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This  report  describes  the  results  obtained,  and  the  methods  and  data  used  in 
the  development  and  computer  simulation  of  classification  logic  for  an  optical 
character  reader  (OCR).  The  work  was  based  on  that  of  a previous  effort  re- 
ported in  RADC-TR-75-232. 

The  OCR  under  development  is  to  have  the  capability  of  reading  unformatted 
non-OCR  text  such  as  might  be  found  in  foreign  technical  journals.  Thus  the 
classification  logic  has  to  accommodate  a wide  variety  of  print  quality 
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degradations  found  in  "uncooperative”  environments. 

A high  resolution,  self-normalizing,  correlation  scheme  was  evaluated  and 
found  effective  in  handling  certain  types  of  random  noise  and  changes  in 
average  reflectance  and  contrast.  However,  it  was  sensitive  to  systematic 
noise  such  as  character  rotation  and  stroke  thickness  variations. 

The  decision  method  used  was  to  compute  and  choose  the  minimum  Euclidean 
distance  between  the  normalized  character  array  and  the  set  of  normalized 
stored  mask  arrays.  The  masks  were  generated  simply  by  registering  and 
averaging  four  representative  examples  for  each  symbol  in  each  font.  The  normal- 
ization was  such  that  the  mean  value  and  standard  deviation  of  the  grey 
values  for  each  array  was  0,  and  1,  respectively.  In  addition,  the  arrays 
were  shifted  to  match  centroids. 

Ten  fonts  (five  Latin  and  five  Cyrillic)  of  data,  totalling  17,611  characters, 
were  used  in  the  experiments.  Forced  decision  error  rates  (no  rejects  allowed) 
averaged  less  than  0.57:  with  the  best  being  0%  and  the  worst  1.1%. 

Most  of  the  errors  were  among  a small  number  of  characters,  called  confusion 
groups.  Special  logic  was  devised  for  one  confusion  group  that  was  common 
among  Cyrillic  fonts,  and  applied  to  the  font  with  the  worst  error  rate.  The 
error  rate  using  the  special  logic  went  from  1.1%  to  0%. 

In  addition  to  the  classification  logic  design,  a scheme  was  evaluated  for 
character  parsing  which  did  not  require  fixed  pitch.  Statistics  differ- 
entiating test  and  non-text  areas  were  devised. 

The  simulation  included  experiments  aimed  at  providing  cost-performance  trade- 
off data  for  hardware  design  considerations.  It  was  found  that  limiting  the 
grey  scale  resolution  to  4-bits  had  no  significant  effect  on  performance 
after  normalization.  It  was  also  found  that  going  from  40  to  80  micron  spacing 
between  image  sami  les  generally  increased  the  error  rate  by  a factor  of  two. 

The  effect  was  more  pronounced  on  Cyrillic  fonts  that  have  a preponderance 
of  thin,  but  important  strokes,  connecting  more  massive,  but  non-discriminating 
strokes. 

Preliminary  array  processor  logic  was  proposed  based  on  current  LSI  technology 
that  would  implement  the  design  simulated.  A breadboard  using  a minimum 
complement  of  processors  should  now  be  built  to  demonstrate  the  speed  and 
reliability  on  increased  amounts  of  data. 
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EVAUIATTON 


Tliis  document  is  the  final  technical  report,  based  on  a 12  month  effort, 
which  summarized  the  technical  efforts  undertaken  to  develop  software 
techniques.  These  techniques  cover  the  classification  and  recognition  of 
five  uncooperative  fonts  each  of  Latin  and  Cyrillic  characters  which  appeared 
in  various  scientific  and  technical  journals. 

This  t'^fort  was  a follow  on  to  a previous  contract  F30602-74-C-0309 
"OCH  Software  Development",  which  resulted  in  the  evaluation  of  a cross 
correlation  technique.  The  cross  correlation  technique  utilizing  grey 
scale  was  highly  position  sensitive,  but  relatively  insensitive  to  random 
noise  appearing  in  textual  material. 

The  results  of  this  effort  provided  information  that  validated  the 
usefulness  of  grey  scale  in  the  recognition  of  thinly  stroked  Cyrillic 
characters,  but  de-emphasized  the  amount  of  data  per  character  required 
for  accurate  recognition.  The  special  software  logic  that  was  created  to 
recognize  the  Latin  and  Cyrillic  characters  along  with  the  confusion 
characters  in  each  group  achieved  an  error  rate  of  approximately  0.5%  on 
the  unformatted  text.  However,  the  speed  of  recognition  for  an  actual  system 
is  far  too  slow  in  software.  The  hardware  recommendations  in  section  6 of 
this  report  are  an  initial  attempt  to  demonstrate  the  advantages  of  utiliz- 
ing microprocessor  technology  for  character  recognition  and  employ  parallel 
processing  techniques  to  significantly  reduce  the  per  character  processing 
time.  In  a future  OCR  device  that  would  take  advantage  of  these  techniques, 
decisions  must  be  made  regarding  the  cost  effectiveness  of  having  a small 
number  of  errors  vs  the  processing  time  trade  off. 


DENN’IS  R.  NAWOJ 
Project  Engineer 


SECTION  1 


INTRODUCTION 

V 

The  past  fifteen  years  have  produced  significant  achievements  in  the 
field  of  Optical  Character  Recognition  (OCR).  Today  there  exist  a number  of 
commercially-available  OCR  systems  which  are  capable  of  reading  a wide  variety 
of  type  fonts,  provided  that  the  input  is  composed  of  ’^igh  quality^  material 
prepared  in  a cooperative  environment.  In  terms  of  human  recognition,  the 
restrictions  implied  by  the  term  !*high  quality"  material  must  be  considered 
severe.  The  machine  reject  rates  on  degraded  material  simply  do  not  corre- 
spond to  those  of  the  human. 

To  some  degree,  the  performance  differential  between  man  and  machine  is 
attributable  to  the  machine's  incomplete  use  of  available  information.  J For 
reasons  of  economy,  commercially-available  OCR  systems  tend  to  use  relatively 
simplistic  recognition  strategies  involving  only  a few  grey  levels.  In  lieu 
of  increasing  the  recognition  computation  and  therefore  the  cost,  the  commer- 
cial manufacturers  have  chosen  to  place  the  burden  on  the  user  to  produce 
"high  quality"  material  at  the  data  source  origin.  This  procedure  can  work 
well  when  the  OCR  system  is  placed  in  a cooperative  environment  where  the 
source  data  are  specially  prepared  for  OCR  processing.  However,  the  scanning 
needs  of  the  Air  Force  often  do  not  lend  themselves  to  such  cooperative  en- 
vironments. In  some  OCR  applications,  absolutely  no  control  over  the  source 
generation  is  possible,  and  yet  the  desire  to  reduce  media  input  costs  still 
exists.  One  such  application  is  the  requirement  to  encode  the  text  of  foreign 
journals  for  the  purpose  of  automatic  translation  and  information  retrieval 
and  analysis.  This  material  is  quite  varied  in  both  quality  and  format,  so 
much  so  that  present-day  commercially-available  OCR  systems  have  been  judged 
inadequate . 

^This  report  describes  the  results  obtained  on  a research  project  to 
evaluate  a recognition  technique  for  typeset  Latin  (English)  and  Cyrillic 
(Russian)  alphabetic  characters  which  have  been  scanned  and  digitized  at 
comparatively  high  spatial  resolution  and  high  grey-level  quantization  by  the 
LIPS  scanner’’*.  A normalized  correlation  technique  was  employed  as  the  recog- 
nition algorithm,  at  the  suggestion  of  the  Government.  This  technique  allows 
recognition  in  severe  noise  conditions  such  as  might  occur  with  poor  quality 
printing.  ^ 

Several  problems  inherent  in  correlation  techniques  were  addressed. 

These  problems  were  demonstrated  during  work  reported  in  the  Final  Technical 


RADC  developed  a high-resolution  laser  scanner  system  known  as  the  Laser 
Image  Processing  Scanner  (LIPS).  The  LIPS  system  is  capable  of  scanning, 
digitizing,  and  storing  on  magnetic  tape,  source  material  with  each  resolv- 
able point  represented  by  one  of  256  grey  levels,  at  sample  spot  size  and 
spacings  from  1.25  to  40  microns. 


Report  ( RADC-TR-75-232 ) for  a related  contract  ( F30602-7U-C-0309 ) . Basical- 
ly, the  solution  to  the  problems  required  finding  the  best  possible  registra- 
tion of  the  scanned  character  to  the  prototype  mask  so  that  spurious  correla- 
tions with  the  wrong  mask  would  not  lead  to  incorrect  classification,  and  by 
preclassifying  by  size  so  to  reduce  the  number  of  candidate  classes  per  char- 
acter. Additional  logic  was  found  necessary  to  resolve  certain  confusion 
groups . 

A correlation  technique  was  applied  as  the  character  recognition  algo- 
rithm for  five  Latin  and  five  Cyrillic  fonts,  using  data  with  high  spatial  and 
grey-scale  resolution.  A data  base  of  characters  from  the  above  mentioned 
fonts  was  collected  from  selected  Russian  technical  journals  and  digitized 
using  the  LIPS  scanner  at  a spot  size  of  40  microns,  using  256  grey  levels. 
These  data  were  chosen  to  be  most  representative  given  the  constraints  imposed 
by  time  and  scanning  and  program  processing  rates.  Initial  investigations 
revealed  that  the  data  could  be  reduced  by  simple  truncation  to  64  grey  levels 
without  any  influence  upon  the  results.  This  could  be  done  to  reduce  overall 
storage  needs. 

A character  isolation  algorithm  was  also  developed  to  segment  the  dig- 
itized images  first  into  lines,  then  to  words  and  then  to  individual  charac- 
ters. The  isolated  characters  were  then  identified. 

Several  examples  of  each  class  were  selected  to  form  prototype  masks 
while  the  rem.ainder  were  set  aside  to  form  a test  set.  After  some  preproc- 
essing, designed  tc  enhance  the  overall  system  performiar.ee,  correlation  was 
performied.  Statistics  on  the  correlation  coefficients  as  a function  of 
character  class  and  font  were  obtained  to  measure  the  effectiveness  of  the 
correlation  technique.  Font-by-font  statistics  were  gathered  to  determine 
both  character  class-pair  problems  and  font  characteristics  which  cause 
problems . 

Over  seventeen  thousand  characters  were  isolated  and  classified  in  this 
study.  The  overall  accuracy  rate  was  better  than  99.5%.  The  results  were 
about  as  expected.  The  additional  resolution  allowed  a high  level  of  dis- 
crim.ination  for  perfectly-aligned,  well  formed  character-m.ask  pairs.  In  other 
words,  the  character-m.ask  variance'-''  was  generally  quite  small  for  a test 
character  and  the  corresponding  stored  mask--several  times  smaller  than  the 
m.inim.um  variance  befween  a character  and  most  other  m.asks. 

Correlation  is  an  area-sensitive  technique.  Thus  any  degradation  that 
alters  the  area  of  overlap  between  a character  and  its  correct  mask  will 
seriously  degrade  their  correlation.  Pandom  noise  does  not  have  as  strong  an 
effect  on  perform.ance  as  does  a rigid  translation  or  rotation,  until  the  noise 
is  of  such  m.agnitude  as  to  affect  the  same  percentage  of  picture  elements 
(pixels)  as  would  the  miisregistration . Thus  a registration  algorithm  was 
required.  The  schem.e  emiployed  matched  centroids  and  then  made  four  small  x 
and  y displacem.ents  computing  a maximum  of  nine  correlations  per  character- 
m.ask  pair. 


As  m.easured  by  the  Euclidean  distance  between  normalized  character  and 
mask  vectors . 


Additional  recommendations  for  modification  of  the  recognition  scheme 
were  made  to  alleviate  other  problems  generally  associated  with  correlation. 
Initially  it  was  thought  necessary  to  include  in  the  variance  calculation  a 
weighting  vector  which  gives  emphasis  to  those  areas  that  are  critical  for 
discriminating  between  similar  characters.  This  is  especially  true  for  cer- 
tain Cyrillic  characters  which  differ  only  in  a small  fraction  of  the  total 
area,  usually  in  the  thin  connecting  strokes.  Thus  an  alternative  method  was 
tried  which  proved  successful  on  a small  sample  of  confusion  classes.  A 
scheme  was  tried  that  weighted  areas  of  difference  between  masks.  However, 
unwanted  differences  - usually  due  to  variation  in  stroke  width  - were  often 
as  large,  in  area,  as  the  important  areas  of  difference. 

Section  2 describes  the  systems,  procedures,  and  programs  used  in  obtain- 
ing data,  masks,  and  processed  results. 

Section  3 describes  the  algorithms  and  procedures  in  greater  detail. 

Section  4 details  the  experimental  results. 

Section  5 further  discusses  suggested  embellishments  to  the  basic  scheme. 

Section  6 describes  the  effort  to  design  test  hardware  for  the  classifi- 
cation logic. 

Section  7 summarizes  the  surrent  work  and  outlines  suggested  future  work. 

Appendix  A provides  the  software  documentation. 

Appendix  B provides  the  experimental  design  timing  estimates  of  a 
single  processor  classification  scheme. 

Appendix  C includes  pages  of  test  extracted  from  each  of  the  Russian 
journals  used. 
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SECTION  2 


PROCEDURE 


The  systems,  programs,  and  procedures  used  in  the  operational  steps  of 
the  project  are  discussed  in  this  section.  The  data  were  scanned  on  the  LIPS 
system..  For  most  of  the  project,  the  scanned  data  had  to  be  copied  from  the 
nine-track  tapes  output  by  LIPS  onto  seven-track  tapes  for  general  use  on  the 
HIS-6180  MULTICS  system.  Several  programs  were  developed  on  the  MULTICS 
system  for  evaluating  the  correlation  algorithms.  Some  of  the  image  data  pro- 
cessed were  taken  from  data  scanned  under  contract  F30602-74-C-0309.  As 
needed,  additional  images  were  scanned  on  LIPS. 

The  LIPS  tape  images  were  then  stored  as  disk  images  by  the  MULTICS 
program  OCR. 

The  disk  files  were  then  processed  by  the  MULTICS  program  DOCR*  which 
isolated  the  individual  characters  in  the  images.  These  characters  were  then 
displayed  in  groups  on  a CRT  screen  and  hardcopies  of  the  displays  were  made 
so  that  the  characters  could  be  visually  identified.  The  character  identities 
were  then  recorded  and  typed  as  ID  files  for  use  by  a program  which  selects 
characters  for  building  masks  and  another  program  which  merges  the  ID  file 
with  a character  file  to  form  an  editted  file  of  labelled  characters.  Some 
characters,  such  as  italics,  bold  face  and  mathematical  symbols,  were  specif- 
ically labelled  to  be  ignored  by  the  recognition  algorithm. 

After  all  of  the  characters  of  a font  were  identified,  the  program  MASK__ 
SELECT  randomly  selects  patterns  to  be  used  as  masks  and  outputs  the  file 
names  and  indices  of  the  patterns  for  use  by  COLLECT_PATTERNS . COLLECT_ 
PATTERNS  copies  selected  patterns  from  a set  of  files  output  by  DOCR  and 
writes  them  into  another  file.  In  this  case  it  is  used  to  collect  four 
samples  of  each  pattern  into  a file  called  MASK_SET.  The  program  MASK_ 
jEN’EFj^TOR  is  then  used  to  input  groups  of  patterns,  perform  whatever  prep- 
aration may  be  necessary,  register  them  to  each  other,  compute  the  "average" 
of  the  patterns,  and  output  the  resulting  patterns  as  masks. 

The  entire  collection  of  masks  is  then  input  to  TRANSFORMER.  TRANSFORMER 
then  com.putes  the  grey-level  normalized  and  truncated  version  of  each  mask  and 
ctjtputs  it  to  a MASK_DIRECTORY , along  with  a set  of  pointers  sufficient  for 
CORREL_MAIN,  the  central  correlation  and  classification  routine,  to  find  each 
mask  when  needed.  TRANSFORMER  is  also  applied  to  each  of  the  pattern  files 
output  by  DOCR;  it  converts  each  of  these  patterns  to  grey-level  normalized 
and  truncated  form. 

Either  as  DOCR  files  or  as  TRANSFORMER  files,  the  pattern  files  are 
merged  with  the  ID  files  so  that  CORREL_MAIN  may  evaluate  the  validity  of  its 
decision . 

" For  clarity,  program  names  will  be  presented  in  upper  case  letters.  In 
Appendix  A,  as  is  customary  in  MULTICS,  they  are  in  lower  case  letters. 


The  files  of  grey-level  nornalized,  truncated,  and  labelled  patterns  are 
then  input  to  CORREL_MAIN  for  processing  with  the  several  correlation-related 
algorithms.  Because  there  are  several  files  of  patterns  per  font,  the  outputs 
of  each  run  of  CORREL_MAIN  are  merged  and  later  processed  by  SUMMARIZE  wliich 
prints  out  summary  error  and  correct  recognition  rates.  They  may  also  be 
processed  by  SUMMARIZE_TRADE_OFFS  which  prints  out  sets  of  error,  reject,  and 
correct  recognition  rates  for  various  confidence  levels. 


SECTION  3 


ALGORITHMS 


3.1.  PREVIOUSLY  IDENTIEIED  PROBLEMS 

In  the  preceding  report  [1],  some  problems  were  identified  as  being 
directly  related  to  the  use  of  correlation.  The  algorithms  which  applied  the 
correlation  techniques  first  addressed  these  problems.  The  problems  identi- 
fied were;  (1)  how  to  deal  with  the  fact  that  some  characters  are  included 
in  others  as  parts,  and  therefore  correlate  well  with  those  parts  of  the 
other  characters;  (2)  how  to  register  mask  and  character  in  order  to  obtain 
an  optimum  correlation;  and  (3)  how  to  increase  the  sensitivity  of  correla- 
tion to  particular,  high-information  parts  of  characters  while  reducing 
sensitivity  to  others. 

The  "inclusion  problem"  is  solved  by  defining  each  actual  pattern 
array  to  exist  within  a larger  virtual  pattern  array.  Thus,  when  a character 
pattern  is  isolated  from  its  surrounding  text,  a larger,  virtual  pattern  is 
defined  as  in  Figure  3-1  with  some  fixed  intensity  value  defined  to  be  the 
additional  background.  The  size  of  the  "virtual"  pattern  is  chosen  to  be  a 
little  larger  than  necessary  to  enclose  the  largest  mask  in  the  system. 

This  does  not  have  a significant  effect  on  the  total  amount  of  computation 
necessary  to  compute  the  correlation.  In  this  manner,  the  correlation 
algorithm  is  forced  to  compute  the  variance  between  both  the  included  and 
non-included  parts  of  test  characters  and  the  masks.  Figure  3-2  contrasts 
the  use  of  the  virtual  array  technique  to  the  earlier  work  in  which  inclusion 
was  a problem.  Size  normalization  would  produce  a similar  result.  However, 
size  ncrm.alization  requires  additional  computation  for  both  the  normalization 
and  correlation  processes,  while  the  "virtual  pattern"  technique  requires 
little  additional  computation. 

The  registration  problem  is  essentially  solved  by  registering  the  cen- 
iroids  (centers  of  mass)  of  the  character  and  each  mask.  Observation  of  a set 
of  correlations  obtained  by  correlation  of  300  well-formed  characters  to  their 
wn  masks  has  shown  that  optimum  correlation  is  always  obtained  when  the 
i.aracter  center  of  gravity  is  within  + 1 pixel  in  either  direction  of  the 
m - cenrroil. 


.ric  , Incorrect  classifications  are  often  made  unless  the  best  regis- 
• .!.taLnef.  As  a first  approximation  to  optimum  registration,  the 
• Cl.  f -e  haracter  and  mask  were  registered  before  the  correla- 
' -*=  •'.  , sf.-  of  300  characters  we  computed  all  correlations 

* • = v'.  *ra’-l  >ns  of  the  characters  to  their  own  (correct)  mask. 

* ' ‘ characters  was  registered  to  each  of  the  points 

• r.  ;iiec*ion  of  the  mask  centroid  and  each  of  the  25- 

■ . • n^:  was  performed.  In  all  of  the  cases,  shifting 

•air.  *h.e  Of  t imun  correlation.  Often,  no  shifting  was 
■ , di^'hough  a shift  would  produce  the  optimum 

. , .T:  ’.if-ed  correlation  to  one  mask  is  so  much 


Isolated  Pattern 


Virtual  Pattern  Arrav 


Figure  3-1  The  Placement  of  an  Isolated  Pattern  into  a Virtual  Array 
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Figure  3-2 

i)  When  the  correlation  is  performed  "in-text",  inclusion  results 

ii)  When  character  and  mask  are  centered  in  a larger,  virtual  array, 
the  inclusion  problem  is  resolved 


better  than  any  correlation  to  any  other  mask  that  a decision  can  safely  and 
correctly  be  made  without  additional  computation  to  find  the  optimum  correla- 
tion. Certain  characters,  however,  still  require  precise  registration. 

An  algorithm  was  designed  to  increase  the  sensitivity  of  correlation  to 
particular,  high- information  parts  of  characters.  This  algorithm  will  be 
discussed  more  thoroughly  in  Section  3.6.  The  following  relates  to  algorithms 
used  in  this  effort.  Flow  charts  in  Section  3.10.  summarize  the  overall  ex- 
perimental recognition  procedures,  and  the  recognition  logic  flow. 

3.2.  TEXT  FIND 

To  separate  text  from  non-text,  some  statistics  of  various  text  and  non- 
text images  were  studied.  The  statistics  studied  were  entropy,  average 
intensity  and  average  length  of  black/white  segments  in  the  binary  quantized 
version  of  the  image.  It  was  observed  that  each  of  the  three  statistics  is 
reasonably  consistent  over  any  segment  of  either  text  or  non-text.  However, 
for  any  transition  between  text  and  non-text,  at  least  one  or  possibly  all  of 
the  statistics  shows  a significant  change.  The  particular  statistic(s)  which 
change  depends  upon  the  specific  nature  of  the  non-text  and  text.  No  attempt 
was  made  to  automate  the  process  of  separating  text  from  non-text. 

3.3.  PARSING  AND  CHARACTER  FIND 

The  automatic  method  for  isolating  characters  in  text  involves  the  three 
stages  of  line  find,  word  find,  and  character  find.  It  is  assumed  that  cor- 
rection of  page  skew  could  be  performed  at  this  point  in  the  processing. 

The  character  extraction  is  performed  by  the  program  DOCR.*  Various 
display  functions  are  optionally  provided  by  subroutines. 

Initially,  a histogram  of  pixel  intensities  is  made.  Figure  3-3  is  a 
sample  data  chip  as  digitized  by  LIPS  and  Figure  3-4  is  the  grey-level 
histogram  of  the  chip.  From  this  distribution,  an  approximate  threshold 
value  for  determining  "character"  vs.  "background"  can  be  selected.  Figure 
3-5  is  composed  of  two  distributions:  the  first  is  the  background  grey- 

levels  and  the  second  is  the  character  grey-levels.  The  image  is  thresholded 
at  the  approximate  point  where  the  background  distribution  ends,  with  every 
pixel  less  than  the  threshold  value  set  to  0.  This  thresholding  eases  the 
character  extraction  process. 

A frequency  count  of  the  number  of  non-zero  pixels  in  each  row  of  the 
image  is  obtained.  Even  though  there  are  several  characters  which  rise 
above  the  height  of  the  small  lower  case  characters  or  drop  below  the  line, 
this  distribution  is  characterized  by  very  sharp  jumps  that  indicate  the 
location  of  the  main  body  of  text  lines  with  each  line  generally  being 
separated  by  at  least  five  pixels.  Figure  3-5  is  a histogram  of  the  number 
of  non-zero  pixels  in  each  row  of  the  text  shown  in  Figure  3-3. 

^ For  clarity,  program  names  will  be  presented  in  upper  case  letters.  In 
Appendix  A,  as  is  customary  with  MULTICS,  they  are  shown  in  lower  case  letters. 
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lli(‘  cxrilation  anil  propanalion  of  sound  pt  rtiirljali. 
sirli  lioii  forces  aixl  fieatiiifr  if  file  .substance  during  uni 
foi  uses  of  a inullifocus  structure  are  investigated  theore' 
of  the  density  are  obtained  and  on  their  basis  the  dist 
the  sound  energy  in  the  niediuin  are  investigated.  It  is  s 
two  perturbation  mechanisms  in  the  medium  the  energy 
produced  are  very  different.  Conditions  for  wliich  variat 
the  focal  region  is  close  to  a (piasistatic  one,  are  establi!- 
tion.  It  is  found  that  in  both  cases  an  anomalous  varia 
region  occurs  when  the  focus  vilocity  is  close  to  that 
heating  anil  siriction  to  nonlinear  polarization  of  the  mi 
sidercd  is  estimated. 


Figure  3-3  LIPS  Image 


Figure  3-4  Grey-Level  Distribution  of  Figure  3-3 
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eure  1-5  Horizontal  lino  Distribution  of 


The  problem  of  correcting  page  skew  was  not  a concerr;  in  this  effort; 
however,  it  could  be  not  efficiently  recognized  and  corrected  at  tfiis  point  in 
the  total  classification  procedure.  Improper  skew  correction  would  be  indi- 
cated by  the  histogram  of  non-zero  pixels  not  being  sharply  defined.  Correc- 
tion of  skew  could  be  obtained  for  an  entire  page  of  characters  and  no  modi- 
fications of  later  algorithms  would  have  to  be  performed  to  account  for  skew. 

After  separating  lines  of  text,  vertical  or  column  histograms  of  the 
number  of  non-zero  pixels  are  computed  for  each  line,  figure  3-6  presents  a 
histogram  of  the  intensity  values  of  line  2 of  Figure  3-4.  The  start  of 
a word  is  defined  as  that  point  in  a line  of  text  where  there  are  at  least  12 
of  15  non-zero  elements  in  that  line's  column  histogram.  Similarly,  a word 
stop  is  that  point  where  at  least  12  of  15  elements  are  zero  in  the  line's 
column  histogram.  When  the  word's  boundaries  are  determined,  the  character 
extraction  occurs. 

The  average  character,  in  the  fonts  that  this  report  has  studied,  is  less 
than  90  pixels  in  width.  Using  this  information,  DOCR  examines  the  column 
histogram  elements  of  each  word  and  locates  a minimum  about  every  90  pixels. 
The  distance  between  the  start  of  a word  and  the  first  minimum  is  the  width  of 
the  first  character.  The  distance  between  the  first  and  second  minimums  is 
the  width  of  the  second  character,  etc.  A raw  histogram  is  then  computed  over 
the  width  of  each  character  to  determine  the  height  of  the  character.  Because 
of  character  overlap,  the  minimum  histogram  element  is  searched  for  instead  of 
a zero  histogram  element. 

3.4.  PREPROCESSIMG 

After  the  characters  are  isolated  a few  preprocessing  operations  must  be 
performed.  The  purpose  of  the  preprocessing  is  to  prepare  the  character 
images  for  the  correlation  programs. 

Two  of  the  preprocessing  functions  are  designed  to  enhance  the  quality  of 
the  images  and  thereby  improve  the  performance  of  the  OCR  system.  The  func- 
tions, SPOT-REMOVER  and  NOISE-CLIPPER,  remove  some  "salt  and  pepper"  back- 
ground noise  from  the  character  patterns.  The  removal  is  performed  by  setting 
the  pixels  to  the  background  fixed  value. 

SPOT-REMOVER  scans  the  character  pattern  to  find  isolated  spots  of  small 
density  and  removes  them.  It  was  sufficient  to  define  "isolated"  to  mean  that 
the  spot  should  be  separated  from  the  primary  image  by  at  least  one  empty  row 
(or  column)  of  background.  It  was  also  sufficient  to  define  as  "small"  any 
spot  whose  black  pixels  comprised  fewer  than  4%  of  the  black  pixels  of  the 
image . 

NOISE-CLIPPER  removes  isolated  black  pixels.  For  this  purpose  a point  is 
isolated  if  it  has  less  than  two  neighbors  in  a four-connected  sense.  This 
function  serves  to  improve  some  of  the  imagery  since  it  does  smooth  out  edges. 
However,  it  also  deletes  som.e  very  fine  connectors  in  poor  images  and  some 
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fine,  thin  serifs  in  a few  fonts.  It  may  be  necessary  to  bypass  this  option 
for  certain  fonts  in  certain  text  sources. 

3.5.  INITIAL  REGISTRATION 

The  character's  center  of  mass  is  then  computed  and  the  pattern  is  trans- 
posed so  that  its  center  of  mass  is  at  the  center  of  the  pattern.  The  center 
of  mass  point  (x,  y)  is  computed  as: 

) xf(x,y) 

^ 

^ f(x,y) 

x.y 

z 

where  x,y  is  computed  for  each  point  (x,y)  within  the  tightest  boundary 
around  the  pattern  and  f(x,y)  is  the  intensity  value  at  (x,y).  IBM  has 
reported  that  normalization  by  first  and  second  moments  was  more  successful 
for  character  recognition  than  other  normalization  techniques  (such  as  size 
normalization)  (IBM  Research  Report  140-68).  Our  centroid  normalization  is  a 
normalization  by  first  order  moment.  The  characters  are  all  of  the  same  font 
so  second  order  moment  normalization  is  unnecessary. 

3.6.  GREY-LEVEL  NORMALIZATION 


\ yf(x,y) 



(1) 


> 

x.y 


f(x,y) 


This  procedure  converts  the  6-bit,  centered,  and  noise-detected  pattern 
to  a grey-level  normalized  truncated  and  thresholded  pattern  of  n bits.  In 
this  operation  4 bits  were  used.  The  following  correlation  formula  was  used: 


c = 1/N 


) 


x-x 


(2) 


**  X —X 

The  normalized  grey-level  variable  x = is  first  computed  for  each  test 

X 

character  pixel.  The  characteristics  of  the  variable  x are  that  it  has  mean  0 
and  standard  deviation  1.  More  important  is  that  its  range  is  limited. 


At  this  point  the  variate  x is  a real  variable  which  requires  more  than 
6 bits  for  representation.  Since  the  goal  of  this  effort  is  to  evaluate  the 
performance  of  the  technique  for  performance  on  a hardware  OCR  device,  x is 
scaled  and  truncated  by  applying  the  transformation 

min  (x,  1.99)  if  x > ^ 

X = 2'"  • , 

max  (x,  - 1.99)  if  X < 0 
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where  m is  the  desired  number  of  bits  to  be  maintained  from  the  right  of  the 
binary  (or  radix)  point.  Thence  forward,  the  value  x is  treated  as  an  integer. 

3.7.  CORRELATION’ 

The  correlation  routines  are  a series  of  subprograms  designed  to  de- 
termine the  identity  of  a character  by  comparing  the  character  pattern  array 
to  a set  of  masks.  The  comparison  process  requires  a maximum  of  three  passes. 

In  the  first  pass  the  pattern  is  compared  to  all  masks  which  meet  some  crite- 
rion of  plausibility.  We  have  used  relative  size  as  the  criterion.  If  the 
pattern  is  of  size  m x n,  (m  rows  and  n columns),  each  mask  which  is  to  be 

compared  to  the  character  must  be  of  size  p x q where  m'  p m"  and  n'  ^ qf. 

n" . Determination  of  the  functions  which  determine  the  relationships  between 
m',  m",  and  m and  n',  n",  and  n is  font-dependent.  If  the  serifs  of  the  font 

are  large  and  not  likely  to  disappear  because  of  poor  ink  quality  or  preproc- 

essing, then  m'  = .86m,  m"  = 1.14m,  n'  = .86n,  and  n"  = 1.14n  were  found  to  be 
adequate.  These  parameters  are  sufficient  to  account  for  variations  in  inking, 
quantization  error,  and  minor  errors  by  the  isolation  process. 

If  the  font  has  small,  thin  serifs  which  occasionally  disappear,  it  is 
necessary  to  modify  the  definition  of  n"  slightly.  The  following  is  an 
example  of  the  modification  which  had  to  be  used: 

n"  = 1.14  X n; 

if  n"  < n + 10  then  n"  = n + 10. 

An  example  of  the  case  in  which  this  is  necessary  is  shown  in  Figure  3-7. 

After  the  first  pass  in  which  correlation  is  performed  against  all  masks 
in  the  proper  size  range,  a list  of  the  most  likely  identities  and  their 
correlation  of  the  input  character  is  obtained.  The  list  is  then  sorted  so 
that  the  choice  with  the  best  correlation  is  first. 

Depending  on  the  system's  confidence  in  the  best  choice  as  the  proper 
decision,  the  system  either  accepts  it  as  the  decision  or  goes  on  to  the  next 
phase  of  processing. 

The  criterion  for  confidence  in  the  decision  is  the  ratio  between  the 
second  best  and  the  best  correlations.  For  some  fonts  it  was  found  that  a 
ratio  of  less  than  1.6  was  sufficient,  for  others  a ratio  of  2.5  was  re- 
quired. It  must  be  pointed  out  that  the  smaller  the  ratio  that  is  used,  the 
greater  the  probability  of  accepting  an  incorrect  decision. 

In  the  second  phase  of  processing,  the  character  is  compared  to  the  set 
of  masks  whose  correlation  (from  phase  1)  is  less  than  p times  the  best  corre- 
lation obtained.  This  procedure  shifts  the  mask  j^l  in  each  direction  and 
correlates  the  shifted  mask  to  the  character.  For  each  mask  the  best  of  the 
eight  correlations  thus  obtained  and  the  correlation  obtained  in  Phase  1 is 
used  as  the  character-to-mask  correlation  output  by  this  phase.  A faster 
version  would  use  an  approximate  scheme  requiring  only  four  shifts. 
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3-7  An  example  of  "trimmed"  or  "lost"  serifs. 
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Again,  at  the  end  of  Phase  2 a determination  is  made  whether  to  make  the 
decision  based  on  the  best  correlation  or  to  go  on  to  the  third  phase. 

Before  entering  Phase  3,  the  list  of  candidate  choices  is  again  shortened 
in  a manner  similar  to  that  used  at  the  beginning  of  Phase  2. 

3.8.  WEIGHTED  CORRELATION 

In  Phase  3,  a series  of  pairwise  comparisons  is  made.  A pairwise  com- 
parison is  performed  for  each  pair  of  candidate  choices  remaining.  Thus,  if 
the  remaining  choices  are  c,  e,  and  o,  three  pairwise  comparisons  are  made:  c 
vs.  e,  c vs.  o,  and  e vs.  o.  Then  the  candidate  with  the  most  votes  (a  candi- 
date receives  a vote  if  it  wins  a comparison)  is  selected  as  the  decision. 

The  comparison  technique  uses  a "weighted"  mask  which  is  computed  as  the 
t difference  of  two  masks  and  then  incorporated  into  the  correlation  formula. 

Figure  3-8  graphically  illustrates  the  strategy  employed. 

What  the  technique  attempts  to  do  is  to  build  a specialized  mask  which 
represents  the  difference  of  the  two  masks  being  compared.  Then,  only  those 
character-to-mask  differences  which  occur  in  the  non-zero  portions  of  the 
weighted  mask  contribute  to  the  correlation  sum. 

3.9.  MASK  GENERATION 

For  the  previous  contract,  masks  were  generated  by  manually  co-register- 
ing and  averaging  together  characters  using  the  DICIFER  system.  The  method  is 
both  tedious  and  prone  to  operational  error.  Most  of  the  masks  generated  for 
the  work  under  this  contract  were  generated  by  automatic  registration  and 
averaging.  The  automatic  registration  was  performed  by  matching  the  centroids 
of  the  characters  to  each  other  and  then  computing  the  average.  The  masks 
were  then  subjected  to  the  same  preprocessing  transformations  which  are 
applied  to  the  test  characters  and  stored  in  a mask  directory  for  quick 
reference  by  the  correlation  program. 

3.10.  SUMMARY 

The  flow  chart  in  Figure  3-9  summarizes  the  overall  experimental 
recognition  procedure  and  Figure  3-10  describes  the  recognition  logic 
flow. 


rigure  3-9  Summary  of  Experimental  Procedure  to  Test  Classif ication  Algorithms 
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Figure  3'10  Summary  of  Classif icat ion  Logic  Flow 


Normalize  Gi’ey 
Levels  in 
Character  Patter 


Remove  Noise  and 
Spots 


Shift  on 
Centroid  in 
Virtual  Array 


Select  Mask  Set 
to  be  used  in 
Stage  I Correia 
tion.  Based  on 


Perform  Stage  I 
Correlation 


Select  Masks  t 
be  used  in 
Stage  II  Cor- 
relat  ion 


Perform  Stage  1 
II  Correlation! 


r 

SECTION  4 

EXPERIMENTAL  RESULTS 


Experiments  were  conducted  to  evaluate  the  hardware  simulation  algorithms 
for  recognition  of  lower  case  characters  from  5 Latin  fonts  and  5 Cyrillic 
fonts.  Error  rates  generally  varied  from  0.0  to  1.0%.  For  some  fonts  the 
experiments  were  simply  a matter  of  running  the  programs  which  had  already 
been  developed;  for  others,  it  was  an  iterative  process  of  detennining 
where  and  why  errors  occurred,  changing  the  mask  set  to  compensate,  and  re- 
running the  programs. 

In  this  section,  the  recognition/error  rates  for  each  of  the  fonts  are 
presented  as  well  as  a discussion  of  these  errors.  The  results  of  investiga- 
tions into  determination  of  text/non-text  data,  error  vs.  rejection  trade- 
offs, and  effects  of  each  segment  of  the  three-pass  classifier  will  be 
presented. 

4.1.  TEXT  VS.  NON-TEXT 

To  separate  text  from  non-text,  some  statistics  of  various  text  and 
non-text  images  were  studied.  The  statistics  studied  were  entropy,  average 
intensity,  and  average  length  of  black/white  segments  in  the  binary  quantized 
version  of  the  image. 

Given  an  M by  N image  f(i,j),  l£f(i,j)  f_K,  the  entropy  E,  a measure 
of  grey-level  randomness,  is  defined  as 

K 

E = - ’ P(k)  log  P(k)  ■ 

k=l 

where  P(k)  = (No.  pixels  of  intensity  k)/MN 


The  average  intensity  f is  defined  as 

M,N 

f = / ) f(i,j)/MN 

i,j  = l 

and  the  average  segment  length  s is  defined  as 
s = MN/N 

s 
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where  is  the  total  number  of  segments  in  the  image. 


A "segment"  of  length  s is  a 1 by  s subimage  in  which  either  all  f(i,j)  ^ T 
or  all  f(i,j)  < T for  a given  threshold  intensity  T. 

In  our  experiment,  M=l,  N=500,  and  T=f.  That  is,  an  image  is  simply  a 
scan  line  of  500  pixels.  100  consecutive  scan  lines  were  used  to  examine 
the  behavior  of  these  statistics.  Figure  4-1  shows  such  curves  from  a text 
image  and  4-2  shows  those  from  a non-text. 

Note  that  each  of  the  three  statistics  behaves  reasonably  consistent 
over  any  image  of  either  text  or  non-text.  However,  siginificant  changes 
occur  within  the  transition  area.  Whether  any  of  these  three  statistics  is 
redundant  and  whether  these  statistics  are  sufficient  to  distinguish  text 
from  various  non-text  content  require  further  investigation. 

4.2.  RECOGNITION  RESULTS 

Table  4-1  presents  the  details  concerning  the  number  of  characters 
processed  and  recognized  in  each  of  the  Cyrillic  fonts.  The  percentages 
presented  are  estimates  of  the  error  rates  which  should  be  expected  from 
reading  the  appropriate  journals  with  the  same  techniques.  The  error  rate 
is  computed  as 


e 


33 

> e . f 


where  the  summation  is  over  all  classes  j of  characters,  e.  is  the  measured 
fraction  of  errors  on  class  j,  and  f.  is  the  frequency  of  ^lass  j expressed 
as  a fraction  of  unity  and  measured  iver  a large  population.  The  simpler 
form 


number  of  errors 
number  of  sample  characters 


might  introduce  bias  due  to  sampling  an  atypical  distribution  of  characters. 
The  frequencies  of  occurrence  used  in  this  study  were  obtained  from  a 
previous  study*. 


" Engineering  Analysis  and  Digital  Simulation  of  the  Optical  Russian  Print 
Reader,  Technical  Documentary  Report  No.  RADC-TDR-62-472 , Sept.  3,  1962. 


Entropy 


Figure  4-1  Statistics  From  Text  Sine 


Table  4-2  describes  the  problem  pairs  encountered  in  each  of  the 
Cyrillic  fonts  and  their  relative  frequency  of  occurrence  within  that  font. 
Tables  4-3  and  4-4  provide  analogous  information  for  the  Latin  fonts. 

In  the  Cyrillic  fonts,  almost  85%  of  the  errors  (48/57)  occurred  among 
the  I',  H,  and  n characters.  The  main  difference  among  these  three  characters 
is  the  presence  or  absence  of  a cross-piece  between  the  two  strong  vertical 
bars.  (This  is  illustrated  in  more  detail  in  Section  5.1.  Another  level 
of  classification  logic  to  deal  with  this  confusion  group  is  discussed  in 
Section  5.2.)  Noise  (See  Figure  4-3),  and  the  varying  width  of  this  cross- 
piece are  the  two  main  reasons  for  misclassification.  Other  errors  occurred 
when  the  stroke  width  of  the  crosspiece  of  "e”  was  thinner  or  thicker  than 
that  of  the  mask,  resulting  in  a classification  as  an"o"  or  a "c". 

In  the  Latin  fonts,  the  recognition  rate  was  generally  higher  than  in 
the  Cyrillic  fonts,  99.74%  vs.  99.43%.  The  main  source  of  errors  was 
chopped  serifs.  The  missing  serifs,  probably  a result  of  overlap  of  characters 
encountered  when  the  characters  were  isolated  and/or  misprints,  explain  the 
b"*-h,  h*^b,  n->-o,  m-^o,  t->-  f,  !-*■  f errors.  Varying  widths  of  the  "e" 
crosspiece  explain  the  e -*■  o and  c ->■  e errors. 

An  experiment  to  determine  the  effect  on  the  recognition  system  of  a 
change  in  resolution  was  conducted  on  a subset  of  the  C6  font.  A decrease 
in  resolution  was  simulated  by  a 50%  linear  decimation  in  each  direction. 

This  was  accomplished  by  replacing  every  two  pixels  with  the  average  of 
those  two  pixels.  The  50%  decimation  also  accomplished  an  overall  storage 
reduction  of  4:1. 

There  occurred  12  errors  in  the  1267  samples  yielding  a recognition 
rate  of  99.05%.  Using  the  full-sized  patterns,  the  recognition  rate  was 
99.39%.  The  trade-off  is  a reduction  in  the  recognition  rate  of  .34%  for  a 
4:1  reduction  in  storage  requiremients.  Considering  the  amount  of  disk 
space  necessary  for  the  isolated  characters  of  font  C6,  910K  for  approx- 
imately 3100  characters,  the  4:1  reduction  seem.ed  worthwhile  indeed. 

However,  in  a specialized  hardware  OCR  system,  all  the  isolated  characters 
of  a font  would  not  be  stored  at  once.  Rather,  they  would  be  processed  and 
then  the  storage  locations  released  for  other  characters.  The  main  hardware 
benefit  would  be  reduction  in  processing  time. 

4.3.  ERROR  VS.  REJECTION  TPADE-OFF 

For  those  fonts  which  had  non-trivial  error  rates  (at  least  1/2  % and 
at  least  10  misclassified  characters),  an  analysis  was  made  of  the  desirabil- 
ity of  rejecting  characters  instead  of  making  forced  decisions.  The  strategy 
selected  to  determine  the  confidence  in  a decision  was  to  require  a particular 
minimum  ratio  of  correlation  before  permitting  a decision  to  be  made.  For 
example,  if  the  correlation  (actually  Euclidean  distance)  of  a character  to 
mask  "c"  is  10,000  while  its  correlation  to  mask  "o"  is  10,010,  a forced 
decision  would  decide  "c".  However,  a decision  which  required  a second  to 
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Table  4-2  Errors  in  Cyrillic  Data 
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first  ratio  of  1.01  would  reject  the  character. 

Tables  4-5  and  4-6  present  sample  recognition-error-reject  rates  for 
some  of  the  fonts.  Figures  4-4  and  4-5  present  this  relationship  graphically. 

For  both  fonts,  C6  and  L8,  the  recognition  rate  declined  as  a function 
of  the  minimum  ratio  required  for  a decision  increased.  Although  the  error 
rate  in  both  instances  dropped  as  the  decision  ratio  increased,  the  overall 
recognition  rate  also  decreased.  This  is  expected,  however,  as  the  increase 
in  the  number  of  rejects  illustrates.  While  the  decrease  in  recognition 
rate  as  the  ratio  increased  from  a forced  (1.00)  decision  to  a 1.03  decision 
is  relatively  small,  so  is  the  effect  on  the  number  of  errors.  A higher 
decision  ratio  than  1.03  had  a greater  influence  on  the  number  of  errors, 
but  the  recognition  rate  for  both  fonts  fell  beneath  99%. 

It  is  felt  that  to  maintain  a high  recognition  rate,  the  decision 
ratio  of  second-best  to  best  correlation  should  be  equal  to  or  less  than 

1.03.  In  this  situation  the  use  of  a reject  mechanism  would  offer  little 
improvement  in  total  performance.  Thus  it  was  felt  necessary  to  add  an 
additional  processing  pass  to  handle  the  confusion  groups,  such  as  [n  ,F  ]. 

4.4.  ANALYSIS  OF  3-PHASE  CLASSIFIER 

The  three-phase  scheme  studied  in  this  project  can  be  basically  described 

as : 

1.  preliminary  classification  with  centroid  registration 

2.  refined  classification  through  character  re-registration 

3.  additional  heuristics  to  improve  the  recognition  rate 

4.4.1.  Preliminary  Classification 

The  correlation  formula  used  in  the  first  phase  is 


c(X,M)=rr)  (x.-m.)  (1) 

— — N • 11 

1 = 1 


where  ^=  < x^x^. . .Xj^  > is  the  grey-level  normalized  input  -naracter, 
— ~ m^m2...mj^>  is  the  grey-level  normalized  mask  of  dimension  N. 


Let  t^  be  the  time  required  to  compute  c(X_,^  by  a particular  processor. 
Then  obviously  the  total  execution  time  to  classify  the  input  ^ in  the 
first  phase  is 


Table  4-5  Font  C6  Error  Trade-Off  Analysis 


Ratio  of 
1st  to  2nd 
Correlation 

No.  of 
Errors 

No.  of 
Re ; ects 

Error 

Rate 

Recof'nition 

Rate 

Forced (1.00) 

13 

0 

0.58% 

99.42% 

1.03 

15 

10 

0.49 

99.16% 

1.06 

10 

28 

0.32 

98.79% 

1.09 

6 ! 

1 

Hi 

0.19 

98.42% 

1.12 

4 i 

[ 

78 

0.13 

1 

97.29% 

1.15 

0 

131 

0 

95.59% 

1.18 

1 

0 

179 

0 

94.02% 

1.21 

'0 

229 

0 

92.51% 
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Table  4-6  Font  L8  Error  Trade-Off  Analysis 


Ratio  of 
1st  to  2nd 
1 Correlation 

No . of 
Errors 

No.  of 
Rejects 

Recopnit ion 
Rate 

Forcedi 1.00) 

16 

0 

99.06% 

1.03 

16 

1 

99.02% 

1.06 

15 

3 

98.95% 

1.09 

12 

7 

98.91% 

1.12 

9 

14 

98.75% 

1.15 

7 

18 

98.64% 

1.18 

4 

27 

98.57% 

1.21 

2 

38 

97.87% 

i 


recognition 

rate 


1,00  1.03  1.06  1.09  1.12  1.15  1.18  1.21 

Ratio  of  Best-to-second-best  Correlation 


Figure  4-4  Font  C6  Error  Trade-Off  Analysis 
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where  [X]  is  the  smallest  integer  greater  than  or  equal  to  X,  K is  the 
number  of  processors  that  can  be  simultaneously  operated, y1/^  is  the  number 
of  masks  to  be  correlated,  and  is  the  time  required  to  load  the  data, 
select  the  mask  with  minimal  correlation,  transfer  the  information  to  the 
second  phase,  etc. 

In  our  system,  the  size  of  the  input  character  X is  used  so  that  X is 
only  correlated  to  those  masks  with  comparable  size.  The  expected  execution 
time,  E(T^),  is  therefore 


E(Ti>  = E 


/ 1 C "\ 

IE  ^ ^ + 1 ' t,  + E2  R 

K y 1 ,1  ; 


( '■ 


(1) 


since-/  depends  on  the  input  character  size,  we  may  write 


E 


1 / 


P.E  { 

1=1  1 \ 


(2) 


where  L is  the  number  of  alphabets,  is  the  probability  of  occurrence  of 

the  character  i in  the  text,  and  S.  is  the  subset  of  characters  with  a 

1 

similar  size  to  that  of  character  i. 


Such  simple 
phase.  Our  data 


prescreening  greatly  improves  the  efficiency  of 
show  that  S.  never  exceeds  18  in  C8  font  with 


the 

85% 


first 
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size(X) 

i;ize(M)  '^115%.  Therefore,  for  K=l,  the  expected  execution  time  with  size 
selection  reduces  to  approximately  one-third  of  that  without  (see  Table  4-7 
for  details). 


4.4.2.  Refined  Classification  Through  Character  Re-registration 


The  same  correlation  formula  (1)  is  used  in  the  second  phase.  For  r 
re-registrations  and  processors  simultaneously  operating,  the  expected 
execution  time  will  be 


ECT^) 


E ) - *2 ' ^ ' 'A 


('2  } Th  ♦ h * ={,''2} 


where  is  the  number  of  masks  required  for  re-registration. 


^ ~ T T T 

Again,  E > ’=  E |S^  | ^ where  is  the  probability  that 

character  i may  occur  in  the  second  phase  and  is  the  set  of  characters 

» 

that  cannot  be  resolved  from  characters  in  the  first  phase.  Therefore,  P. 

1 

12  12 

= P..P.  where  P.  is  the  probability  of  character  1 passing  to  second 

X 1 1 

12 

phase.  P^  depends  on  the  recognition  technique,  obviously.  Table  4-7  also 

. . 12 

summarizes  an  estimation  of  P^  's  C8  font  by  using  our  algorithm.  Notice 

t 

because  of  the  sample  size,  P^,  P^,  etc.  in  Table  4-7  are  the  frequency  of 
occurrence  of  character  i,  expressed  as  a fraction  of  unity,  as  opposed  to 
the  probability  per  se. 
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Table  4-7  Expected  No.  of  Correlations  and  Frequency  of  Occurrence  for 
Cyrillic  Data 


A 

B 

C 

D 

E 

F 

G 

H 

i id 

P.(%)^  E 

s.  3 

p!^(%)‘^ 

Pl(%)^  E 

S ' ^ 

23  4 

p:  (%) 

P'.'(%)‘* 

1 

4 

E S'.' 

#Pairs 

1 

1 

1 

1 

i 

1 

1 

1 a 

7.30 

17.43 

32.29 

2.36 

4.23 

3.23 

0.08 

3.00 

3 

2 ^ 

1.57 

5.17 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

3 B 

4.61 

17.33 

53.13 

2.45 

4.94 

14.71 

0.36 

2.00 

] 

4 r 

1.33 

13.54 

62 . 50 

0.83 

3.13 

0.00 

0.00 

- 

- 

5 U 

2.75 

5.94 

6.45 

0.18 

2.00 

0.00 

0.00 

- 

- 

9.66 

13.19 

56.59 

5.47 

3.55 

2.74 

0.15 

2.00 

- 

8 X 

0.81 

6.00 

0.00 

0.00 

_ 

0.00 

0.00 

_ 

_ 

9 3 

1.68 

10.28 

16.67 

0.28 

2.00 

0.00 

0.00 

- 

- 

10  I' 

8.88 

11.41 

94.85 

8.42 

4 . 94 

95.35 

8.03 

3.03 

3 

11  F 

0.93 

5.11 

5.56 

0.05 

2.00 

0.00 

0.00 

- 

- 

12  K 

3.54 

13.  31 

56.86 

2.01 

5.90 

0.00 

0.00 

- 

- 

13  a 

4.55 

13.42 

63.51 

2.89 

7.51 

2.13 

0.06 

5.00 

10 

14  M 

3.25 

9.13 

9.26 

0.30 

5.60 

0.00 

0.00 

- 

- 

15  " 

6.67 

11.95 

95.00 

6.34 

3.82 

37.89 

2.40 

2.94 

3 

16  0 

10.47 

17.62 

6.92 

0.72 

4.91 

0.00 

0.00 

- 

- 

17  n 

2.40 

11.23 

94.87 

2.28 

4.60 

87.18 

1.98 

2.97 

3 

18  F 

5.37 

13.35 

83.53 

4.49 

3.24 

0.00 

0.00 

- 

- 

19  T 

6.64 

15.67 

56.25 

3.74 

2.41 

3.17 

0.12 

2.00 

1 

20  y 

2.53 

4.88 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

21  t 

0.31 

2.00 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

22  X 

1.03 

13.31 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

23  •-( 

0.31 

5.00 

50.00 

0.16 

2.25 

0.00 

0.00 

- 

- 

24  4 

1.33 

12.92 

36.00 

0.48 

6.89 

0.00 

0.00 

- 

- 

25  13 

0.81 

2.33 

16.67 

0.14 

2.00 

0.00 

0.00 

- 

- 

26  ri 

0.41 

6.67 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

27  H 

2.45 

4.63 

16.67 

0.41 

2.80 

0.00 

0.00 

- 

- 

28  F 

0.37 

9.50 

70.00 

0.26 

4.86 

0.00 

0.00 

- 

- 

29  ■' 

0.49 

4.46 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

30  ^ 

1.07 

11.81 

36.11 

0.39 

3.69 

0.00 

0.00 

- 

- 

31 

5.27 

6.04 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

32  b 

1.19 

17.81 

80.95 

0.96 

5.47 

5.88 

0.06 

2.00 

1 

33  1 

0.01 

11.00 

0.00 

0.00 

- 

0.00 

0.00 

- 

- 

1.  Deleted 

2.  Normalized  by  the  total  percenta,p;e  of  small  letters  (98.82%)  in  the 
populat ion 

3.  Size  criterion:  0.85  (character  size)/(mask  size)  < 1.15 
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Table  4-7  (Continued) 


i 


4.  On  1506  C8  font  data  set 

5,  Number  of  pairs  when  weighted  correlation  is  involved 

A;  Frequency  of  occurrence  of  the  character  in  the  population 
B:  Number  of  masks  used  for  correlation  in  Phase  I 

C:  Estimate  percentage  of  characters  passed  to  Phase  II 

D:  Estimated  frequency  of  occurrence  of  the  characters  passed  to  Phase  II 
E:  Number  of  masks  used  for  correlation  in  Phase  II 

f;  Estimated  percentage  of  characters  passed  to  Phase  III 

G;  Estimated  frequency  of  occurrence  of  the  characters  passed  to  Phase  III 
H:  Number  of  masks  used  for  correlation  in  Phase  III 
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12  ' 

Both  P.  and  E -||S^|^also  depend  on  the  criteria  used  to  pass  the 

characters  to  the  second  phase.  In  our  study,  a real  number  t^^^  chosen 

» 

so  that  the  mask  ^ will  be  passed  to  the  second  phase  along  with  M if 
c(X,M') 

■~~~  ^ ^12 
c(X,M) 

where  c()<,^J  is  the  minimum  correlation  between  )('s  and  M's  obtained  in  the 

• . 12  /T  ' I 

first  phase.  It  is  obvious  then  that  as  t,„  increases,  P.  and  E<  S. 

12  ’ 1 1 

. . < I 

increase  and  their  E-^  T„  .^increase. 

1.4.3.  Additional  Heuristics  To  Improve  The  Recognition  Rate 

Let  t^  be  the  execution  time  used  for  a particular  heuristic  in  order 
to  improve  the  recognition  rate  in  the  third  phase  (or  the  fourth,  the 
fifth,  etc.).  By  the  same  argument,  we  may  write  the  expected  execution 
time  as 


E 


+ ts  + E Rg 


(5) 


E <,  '3y»  again,  is  a function  of  the  criteria  used  to  pass  the  character 
to  the  third  phase  from  the  second  phase. 
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The  pairwise  weighted  correlation  used  in  the  third  phase  requires, 
for  each  mask  pair  M,  M' : 

a.  N U to  calculate  V);  = |m.-m.l 

A 1 1 ' 

b.  (2N-1)  u + (2N+1)  tJy  to  calculate  .(x.-m.)^/N 

A M 111 

c.  (2N-1)  y + (2N+1)  to  calculate  Zw  .(x.-m.)^/N 

A M 111 

d.  1 comparison  to  choose  the  smaller  of  c(X,M)  and  c(X,M'). 

Therefore,  t^  = (5N-2)  + (4N+2)  is  required  for  each  pair  of 

masks.  Notice,  for  3 masks  input  to  the  pairwise  weighted  correlation 
scheme,  there  are  3(  3-l)/2  pairs  to  be  compared. 


When  N is  large  ('  2500  in  our  experiments),  the  time  for  becomes 


sgligible  and  E ^ \ i 


Similarly,  E ■^T^y' - ^'’[^■'^^'*^1  second  phase  and  E 

tg  in  the  third  phase. 


3.V 


2. 2 ( ^ ' 

Estimated  values  of  P.  , Es  |s.|  V , etc.,  of  the  Cyrillic  characters 

^ \ 1 y 

of  font  C8  are  given  in  Table  U-7  along  with  given  in  [3].  The  reliability 
of  these  values  has  not  been  evaluated  because  of  time  limitation.  From 
these  figures,  one  may  calculate  the  E-^  .^'V,  E 2 ^ '^'3 sequential 

simulation  as 
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One  is  a general  weighted 


Two  heuristics  were  used  in  the  study, 
correlation  (Section  3-6)  and  the  other  is  a feature -looking  technique 
specially  designed  to  resolve  the  set  [h  , 1!  ,n  ] in  Cyrillic  alphabets. 
Their  time  complexity  will  be  analyzed  in  the  following  subsection. 

4.4.4.  Time  Complexity  In  Sequential  Simulation 

In  order  to  discuss  the  cost-performance  trade-off  of  a practical  OCR 
system,  one  must  analyze  the  time  required  for  character  recognition.  Such 
analysis  is  very  difficult,  if  indeed  possible,  unless  the  system  has  been 
determined  in  detail,  expecially  when  parallel  processing  is  involved.  The 
following  sequential  simulation  analysis  can  only  give  a first  order  approx 
imation . 


In  the  first  two  phases,  it  requires  N subtractions,  N multiplications 
N-1  additions,  and  one  division  in  order  to  obtain 


c(X,M)  = 


i=l 


(x.-m. )‘ 
1 1 


Therefore , 

t,  = (2N-1)  + (N+1)  u„ 

1 A M 


where  and  are  the  time  required  to  oerform  one  addition  and  one 


multiplication,  respectively. 


12.43 


f.  ) 


33  , ^ \ 

> P.  E<  I S,  I > 


e/''  ^ 

\ 3 


3 3 „ „ N 

> P.  E'M  S.  I ^ 

i = l ^ 


Therefore 


E^  Tj_  I'  = 12.43 


7.87  for  r=4 


F • T ^ 


0.39  t. 


If  we  further  assume  that  1 multiplication  consumes  twice  as  much  time 
as  1 addition,  then  t^  ~ 3.25  t^  and  E = 1.27  t^. 


According  to  TaMe  4-8,  the  recognition  rate  for  font  C8  after  phase  1 
is  94.82%,  after  phase  2 is  98.76%,  and  after  phase  3 is  98.89%.  Therefore, 
it  seemis  the  performance  of  the  weight  correlation  in  phase  3 is  not  worth 
the  time  cost  for  the  sequential  processing. 


The  feature -looking  technique  is  used  when  the  input  character  has 
been  determined  as  belonging  to  [H , M , n ] . It  requires 
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Table  4-ft  Number  of  Chai'-acters  Peaching  each  Stage  of  Classification 


Pont 

No.  of 
Chars, 
in  1st  Pass 

No.  in  2nd 
Pass , % 

No . in 
Pass 

If 

a 

V 

# 

Cl 

1654 

717 

44% 

166 

C2 

2038 

596 

29% 

C4 

1972 

144 

7% 

C6 

3085 

1802 

58% 

C8 

1506 

706 

46% 

205 

LI 

1599 

608 

38% 

48 

L2 

1128 

420 

37% 

76 

L4 

1491 

608 

41% 

54 

L5 

1396 

600 

43% 

60  ■ 

L8 

1742 

877 

51% 

179 

Total** 

10516 

4536 

43% 

788 

3rd 


14% 

3% 

7% 

4% 

4% 

11% 


Information  unavailable  at  processing  time 

Total  only  includes  the  7 fonts  for  which  all  information  is 
ava i lable 
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a.  . N to  get  the  vertically  projected  values  in  the  central 
part  of  the  input . 

b.  y /n"  comparisons  to  find  the  location  of  the  peak  if  there  is 
any  (here  we  simply  assume  that  the  dimension  of  the  input  is 
by  /N) . 

c.  2 comparisons  for  decision  making. 

Assuming  1 comparison  requires  1 add-time,  we  have 

tj  --  (i  » . i /»  . 2)  i 

Since 

M It  II 

P,,  + ^.,  + P = 14.08%,  the  estimated  execution  time 

H n 

E {Tg}  = i *0.1408  t^  = 0.018  t, 

According  to  our  study,  84%  of  the  error  is  due  to  the  indistinctness  among 
and  n . For  the  C8  font,  this  means  the  error  rate  can  be  reduced  to 
.18%.  That  is,  by  spending  0.09%  more  execution  time  (0.018/  (12.43  + 
7.87)),  one  may  get  an  0.95%  increase  in  recognition  rate.  It  is  therefore 
recommendable  for  sequential  processing. 
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4.4.5. 


The  Latin  font  results  are  analogous  to  the  Cyrillic  results.  The 
second  phase  resolved  65  of  the  88  total  first  phase  errors.  The  third 
phase  offered  no  improvement  over  the  second  phase. 


Table  4-8  provides  information  about  the  number  and  percentage  of 
characters  that  reached  the  second  and  third  phases.  The  criterion  for 
determining  if  a second  and  third  phase  are  needed  is  the  following:  the 

ratio  of  the  correlation  of  a given  character-mask  pair  to  the  lowest 
correlation  of  all  the  character-mask  pairs  is  computed  for  all  character- 
mask  pairs.  All  the  masks  which  have  a ratio  less  than  2.5  are  subjected 
to  the  second  phase  of  correlation.  The  correlations  are  again  computed  as 
are  the  correlation/  lowes'^  correlation  ratios.  All  the  masks  which  have  a 
ratio  of  less  than  1.4  passed  on  to  the  third  phase  of  correlation  where 
the  weighted  difference  masks  are  employed.  The  sample  is  then  classified 
as  the  character  which  corresponds  to  the  lowest  h.a-i  t'-r-mask  correla- 
tion. If,  at  the  end  of  any  phase  of  correlation,  only  one  cliaracter-mask 
pair  meets  the  ratio  criterion,  then  a class! f ica* : cr.  made  and  the 
process  is  terminated.  An  example  of  thi.  r lassi:  i o.-*- i r.  method  is  illustrat- 
ed below;  the  "c"  is  correctly  labelled  a "c". 


Table  4-9  presents  the  recognition  after  eacl;  phase  of  classification. 
Using  only  center  of  m.ass  registration  resulted  in  an  average  Cyrillic 
recognition  rate  of  97.27%  with  2 fonts.  Cl,  92.62'^  and  '18,  94.82%,  doing 
poorly.  After  the  additional  ^1  pixel  shifting,  the  average  rate  rose  to 
99.43%  with  the  2 troublescir.s  fonts.  Cl  and  C8,  significantly  improved. 
Because  the  mask  weighting  algorithm  was  not  fully  developed  at  the  time 
that  fonts  C2,  C4,  and  C8  were  processed,  the  third  phase  results  are  not 
available  for  those  fonts.  However,  of  the  other  two  fonts,  the  third 
phase  resolved  only  2 of  the  19  C8  errors  and  none  of  the  13  Cl  errors. 


Table  4-8  reveals  that,  on  the  average,  43%  of  the  samples  of  a lont 
are  passed  to  the  second  phase  and  7%  of  all  the  samples  were  forced  passed 
to  the  phase.  The  recognition  rates  rose  significantly  in  the  Cl,  C8  and 
L8  fonts  with  the  use  of  the  second  phase.  The  overall  number  of  Cyrillic 
errors  dropped  from  280  to  59  with  the  use  of  the  second  phase.  Using  the 
time  analysis  results  obtained,  this  can  be  expressed  as  2.16%  (99.43%- 
97.27%)  decrease  in  errors  for  a 63%  (7.87  t./12.43  t^)  increase  in  computer 
time.  The  decrease  in  errors  using  the  weighted  correlation  in  the  thii'd 
phase  does  not  justify  the  increase  in  computer  time. 


} 
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Table  4-9  Recop, nition  Results  for  each  Stage  of  Classification 


Errors  After 
Shifting, 
Recognition  Rate 


Mask  Weighting,, 
Recognition  Rate 


Font 


1654 

2038 

1972 

3085 

1506 


Cyrillic 

Subtotal 


Subtotal 


Overall 


Information  unavailable  at  processing  time 


SECTION  5 


IMPROVEMENTS  TO  THE  PXCOGNITION  SCHEME 


Figure  5-1  illustrates  a problem  that  occurred  in  the  character  isola- 
tion section  of  the  OCR  system.  To  allow  for  the  possibility  of  overlap  of 
neighboring  characters,  the  extraction  program  DOCR  searches  every  90 
elements  for  the  minimum  number  of  non-zero  pixels  in  the  line  of  text 
vertical  histogram.  The  "t"  and  "h"  in  line  2 of  Figure  5-1  overlap  while 
there  is  a zero  pixel  count  after  the  "h".  Since  the  combined  width  of  the 
"t"  and  "h"  characters  is  less  than  the  maximum  allowed  width,  these  two 
were  isolated  as  one  character,  "th". 

Another  isolation  problem  is  illustrated  in  Figure  5-2.  This  takes 
place  when  the  non-zero  pixel  count  minimum  occurs  before  the  end  of  the 
character  encountered.  This  can  be  a result  of  a character  width  greater 
than  90  pixels;  for  example,  some  Cyrillic  capitals,  or  when  there  occurs  a 
printing  flaw  and  there  exists  a break  in  the  character  itself. 

One  possible  solution  to  these  isolation  difficulties  is  to  have  the 
extraction  and  classification  subsystems  interact  with  one  another.  When 
the  classifier  rejects  a character  or  consecutive  characters,  these  charac- 
ters should  be  subjected  to  another  level  of  extraction.  The  width  of  the 
"character"  after  the  first  isolation  would  reveal  the  existence  of  two  or 
more  overlapping  characters.  These  overlapping  characters  can  be  isolated 
by  searching  for  a minimum  within  the  extracted  "character".  This  would, 
however,  require  storage  of  the  raw  scanned  data  for  all  characters  not 
^ completely  processed,  or  the  ability  to  rescan  from  the  printed  page. 

i When  successive  charactei s are  rejected,  this  could  also  be  an  indica- 

i tion  of  "over-isolated"  characters.  The  consecutive  reiects  could  be 

t concatenated  together  to  fcri;,  a new  "charact'jr”  and  be  submitted  for  r^e- 

classif ication . 

.As  noted  in  Secti'.'.!.  -i . , almost  85%  (48/;")  of  the  errors  in  the 

Cyrillic  fonts  occurred  ar.-,:..  F,H,  and  n characters.  The  following 

discussion  will  provide  insigut  into  the  cause  of  classification  errors  and 
an  alternative  method  of  classification  for  these  three  characters. 

Figures  5-3  through  5-'  [resent  these  three  masks  for  the  C2  font. 

For  illustrative  purposes,  thiC  intensity  values  for  each  pixel  have  been 
quantized  into  two  levels,  symbolized  by  a blanx  or  "0".  Figure  5-6  is  a 
sample  from  font  C2  also  cuantized  into  two  levels,  but  symbolized  by 

a blank  cr  "x".  Figure  5-7  shov/s  this  sample  overlayed  with  tlie  H mask. 
Figure  5-8  shows  the  diff  renr.t  between  the  mask  and  the  character  (quan- 
tized ii.to  several  level?  '.  Figures  5-9  and  ‘^-l-.i  sl.ow  the  Tf  character 
over.layed  with  the  i:  mask  and  'he  cifference  etween  the  two,  respectivc.'y . 
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figure  5-2  Example  of  Split  Character 
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Figure  5-7  H Mask,  h Sample  Overlay  Figure  5-8  H Mask,  HSample  Overlay  Difference 
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Figure  5-9  n Mask,  H Sample  Overlay  Figure  5-10  n Mask,  H Sample  Overlay 
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Comparison  of  Figures  5-8  and  5-10  shows  that  the  least  amount  of  overlap 
occurs  in  Figure  5-8,  the  H character  and  the  H mask,  although  the  differ- 
ence in  the  amount  of  overlap  between  the  two  comparisons  is  small.  This 
is  reflected  in  the  difference  between  the  nhl  correlation  of  5968  and  the 
'!TI  correlation  of  7012.  This  character  was  correctly  classified  as  H. 

Figures  5-11  through  5-15  present  another  H sample,  as  above,  which 
was  incorrectly  classified  as  a n . The  H sample  is  slightly  thinner  in 
the  vertical  strokes  than  the  previous  sample  and  slightly  shorter  in 
height  than  the  H mask.  There  also  exists  a chopped  or  mutilated  serif. 

The  decision  between  H and  n is  very  close  with  a h/it  correlation  of  9484 
and  a tj/'!  correlation  of  9604,  a difference  of  only  1.3%. 

The  regeneration  of  the  H mask  by  selection  of  a higher  threshold  to 
reduce  the  "spread"  of  the  mask  could  possibly  create  a correct  decision  in 
the  second  example  cited  above.  This  technique  of  higher  thresholding  of 
certain  masks  generally  decreased  the  error  rate  to  an  acceptable  level. 
However,  there  are  two  problems  associated  with  this: 

1.  The  iterative  process  of  selecting  the  best  combination  of 
differently  thresholded  masks  is  operator-dependent  and  requires 
many  runs  of  data  to  determine  what  is  the  best  combination,  and 

2.  Adjusting  for  instances  of  a specific  error  type  often  creates 
other  errors  when  applying  a given  combination  of  masks  to  more 
data. 

Since  85%  of  the  Cyrillic  errors  occurred  in  the  i!,  H,  n subset,  an 
alternate  level  of  logic  was  introduced  for  these  three  characters  to 
reduce  the  overall  error  rate.  After  a sample  has  been  identified  by  the 
classification  system  as  a member  of  this  subset,  it  is  subjected  to  an 
area-sensitive  logic  that  examines  the  main  area  of  difference  between 
these  characters.  A vertical  or  column  histogram  is  made  and  the  area 
between  the  twc  vertical  segments  of  the  character  extracted.  See  Figures 
5-16  through  5-18.  A horizontal  histogram  is  then  performed  over  the 
extracted  area.  This  area  is  enclosed  by  the  dotted  line  in  Figui'es  5-16 
through  5-18.  The  resulting  distribution  readily  identifies  the  character 
as  a I' , H ^ or  n . 

The  C8  font  was  selected  as  the  test  set  for  this  algorithm  because 
all  of  the  17  errors  in  this  font  were  misclassif ied  I'^'s  and  FI's.  The 
error  rate  was  reduced  to  0%. 

As  a result  of  this  investigation,  it  is  suggested  that  this  alternate 
logic  for  the  three  character  subset  be  incorporated  into  the  basic  corre- 
lation system.  This  also  suggests  that  other  confusion  groups  may  be 
resolved  with  special  logic. 
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igure  5- 16  Row  and  Column  Histograms  of  lA  Character 
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Th«'  price  of  using  special  logic  is,  at  least  initially,  the  require- 
ment for  manual  design  thereof.  Thus  the  simple  mask  design  scheme  origi- 
nally envisioned  for  the  correlation  approach  would  no  longer  be  adequate 
to  achieve  the  highest  level  of  performance  possible.  The  inability  to 
automatically  train  on-line  has  to  be  evaluated  in  the  context  of  the 
intended  application( s ) . It  is  expected  that  if  the  relative  frequency  of 
new  font  design  is  low  then  on-line  design  is  not  a necessity.  In  the 
foreign  journal  translation  case,  after  the  initial  designs  are  done,  the 
new  font  frequency  would  be  determined  by  the  publishers  of  new  and/or  old 
journals  and  the  rate  these  are  acquired  by  the  Government.  The  publishers 
would  have  to  automate  their  font  design  for  printing  to  make  it  worthwhile 
to  automate  our  font  design  for  reading.  In  the  meantime,  manual  design  of 
new  fonts  for  OCR  classifier  logic  could  be  provided  quickly  and  efficiently 
by  skilled  government  and/or  contractor  staff  who  are  familiar  with  the 
system  specifications,  and  who  understand  OCR  design  principles. 


SECTION  6 


HARD'WAP£  CONSIDERATIONS 

6.1.  BACKGROUND 

The  concept  of  a multi-font  machine  is  not  new.  Several  approaches 
have  been  suggested  and  implemented.  The  OCR  unit  built  for  Social  Security 
provides  separate  firmware  recognition  logic  for  all  expected  fonts[4],  A 
more  efficient,  but  probably  less  accurate,  approach  would  use  a small 
number  of  sophisticated  recognition  logics  which  can  generalize  over  several 
fonts  each.  Another  approach  allows  the  OCR  to  be  reconfigured  at  run-time 
by  reading  into  a control  memory  the  recognition  logic  parameters  appropriate 
for  the  current  run.  An  adaptive  approach  which  requires  each  character  to 
be  identified  correctly  by  a human  trainer  has  been  promoted  as  providing  a 
semi-automatic  means  to  extend  the  basic  repertoire  of  an  0CR[5]. 

The  firmware  approach  is  expensive  and  still  is  limited  to  a fixed 
repertoire  of  fonts.  The  lead-time  to  add  a new  font  is  relatively  long. 

It  has  been  proposed  to  use  hand-printing  recognition  logic  to  provide  the 
generalization  capability  over  many  machine  print  fonts  because  its  recognition 
logic  must  perform  well  for  many  styles  of  printing  and  generally  poorer 
character  print  quality.  However,  this  added  sophistication  was  not  needed 
for  installations  reading  only  a few  different  OCR  fonts  and  therefore  not 
commercially  developed.  Furthermore,  this  logic  would  not  take  advantage 
of  the  a priori  knowledge  of  the  font  size  and  shape  specifications. 

Machine  print  differs  from  hand-print  basically  in  its  geometric  repeat- 
ability. The  machine  rcconfiguration-at-run-time  approach  economizes  on 
hardware  costs,  by  requiring  a smaller  number  of  more  standard  logic  units, 
each  having  memory  loaded  under  software  job  control.  This  technique  is 
also  limited  to  those  fonts  available  at  the  moment,  but  new  fonts  can  be 
added  using  off-line  software  recognition  logic  design  routines.  If  quicker 
adaptation  to  new  fonts  is  required,  on-line  design  can  be  obtained  and 
implemented  in  software  using  adaptive  training  procedures.  However,  the 
throughput  (processing  speed)  performance  resulting  is  slow,  because  of 
software  logic  implementation,  and  the  accuracy  is  highly  sensitive  to 
errors  introduced  in  training  by  insufficiently  skilled  operators.  In 
order  to  completely  resolve  certain  character-pair  confusions,  it  is 
probably  necessary  to  design  special  logic  that  can  be  applied  to  just 
those  cases.  This  suggests  the  need  for  off-line  design  by  skilled  tech- 
nical personnel.  The  design  techniques  would  use  a simulation  of  the  basic 
hardware  classifier.  The  resulting  data  would  then  be  used  by  the  on-line 
hardware  to  reconfigure  its  operation  appropriately. 

Correlation  search  for  the  best  match  is  performed  or.  digital  image=: 
sampled  in  rows  and  columns  by  shifting  the  unknown  charac*:er  ima,-.  let*- 
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and  right,  and/or  up  and  down,  about  the  centroid  of  each  mask,  and  choosing 
the  lowest  variance  (highest  correlation).  The  present  variance  function 
(implemented  in  the  DIMES  system  on  the  HIS  635  and  in  Pl/I  on  the  HIS 
6180)  also  normalizes  the  effects  of  different  average  grey-levels  and 
contrasts  (intraimage  variances).  However,  there  are  no  explicit  normaliza- 
tions for  rotation,  skew,  and  non-isotropic  scale  changes.  The  effects  of 
these  variable  were  previously  investigated. 

Successful  character  recognition  depends  not  only  on  powerful  classifi- 
cation logic,  but  also  on  character  detection,  location,  isolation,  and 
registration  techniques.  Once  the  next  character  to  be  read  is  located  and 
isolated  from  its  neighbors  and  from  extraneous  markings,  the  recognition 
routine  must  accurately  register  the  character  and  each  of  the  stored 
masks.  Detection,  location,  and  isolation  are  usually  performed  by  logic 
separate  from  the  classification  logic,  especially  when  automatic  tracking 
of  lines  of  text  is  required.  However,  correlation  values  can  also  be  used 
to  obtain  a measure  of  registration  to  obtain  the  best  fit  when  nearly 
registered.  Thus,  it  is  possible  to  integrate  the  classification  logic 
with  a control  algorithm  for  character  registration.  As  described  above  in 
Section  3,  such  integration  was  included  in  the  tested  design. 

Nearly  all  OCR's  depend  on  constant  pitch  and  fixed  formatting  (as 
opposed  to  free  form)  of  characters  to  be  read.  In  addition,  it  is  usually 
known  what  the  font  is  at  each  field  (space  allocated  for  a single  logical 
data  item).  In  this  case,  the  need  to  handle  typeset  characters,  variable 
spacing,  freeform  formatting,  and  unpredictable  font  type  and  size  must  be 
accommodated.  An  additional  complication  which  occurs,  particularly  in 
technical  journals,  is  the  presence  of  graphics.  Line  drawings  and 
illustrations  can  occur  at  unpredictable  locations  on  a page  of  text  as 
well  as  cover  entire  pages.  Manual  cut-paste-and-scan  operations  are  the 
common  way  of  obtaining  these  data  at  present.  The  control  algorithm  for 
detecting  and  locating  the  text,  data,  and  graphics  must  perform  well  in 
spite  of  all  of  these  peculiarities. 

The  work,  under  contract  F30602-75-C-0269 , was  conducted  in  the  context 
of  an  eventual  hardware  implementation.  That  is,  the  knowledge  of  current 
and  announced  computational  components  and  techniques  has  strongly  influenced 
the  direction  and  form  of  the  design  evaluation  software. 

6.2.  THE  PROPOSED  TRIAL  HARDWARE  FOP,  CLASSIFIER  LOGIC 

Hardware  construction  for  the  proposed  OCR  system  coulli  be  accomplished 
in  three  phases.  The  first  phase  would  comprise  the  development  of  a 
limited  character  processor  subsystem  which  implements  the  classifier  logic 
and  interfaces.  Phase  2 will  put  this  subsystem  on  line  with  the  specified 
nost  computer.  A third  phase  could  expand  the  system  to  meet  the  total 
performance  specifications. 
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A PDP-11/45  such  as  one  of  several  now  located  at  PAR,  could  simulate 
the  host  computer  for  Phase  1.  The  11/45  itself  is  incapable  of  supporting 
the  desired  processing  speeds  necessary  for  the  functions  of  inputting 
scanned  text,  and  detecting,  locating,  and  isolating  characters  with  the 
desired  resolution;  but  that  is  not  deemed  necessary  for  the  Phase  1 effort. 
Its  primary  function  will  be  simply  to  transmit  to  the  character  processor 
subsystem  previously  digitized  and  isolated  characters  for  the  purpose  of 
recognition,  to  receive  back  the  decision  or  rejection  information,  and  to 
tabulate  the  performance  statistics. 

The  character  processor  subsystem  proposed  will  consist  of  several 
microprocessors  organized  as  shown  in  Figure  6-1.  It  consists  of  one 
microprocessor  which  serves  as  character  processor  input  controller;  this 
microprocessor  must  distribute  character  information  to  the  other  micro- 
processors. A second  microprocessor  serves  as  the  character  processor 
output  controller  and  as  arbiter  of  their  decisions. 

The  remaining  microprocessors  are  character  processors.  They  each 
have  access  to  sufficient  memory  to  store  the  masks  for  the  characters  in 
the  font(s)  of  interest.  The  function  of  the  proposed  character  processor 
is  to  determine  a measure  of  the  correlation  between  the  mask  in  its  memory 
and  the  particular  character  being  processed.  A norm  monotonically  related 
to  Euclidean  distance  will  be  used  to  determine  correlation.  Also,  special 
logic  will  be  used  to  discriminate  between  characters  that  tend  to  look 
alike  except  for  differences  that  are  small  in  area. 

The  proposed  character  processor  controller  will  poll  the  character 
processors  to  determine  when  a processor  is  ready  to  receive  a new  character 
or  transmit  the  results  of  the  previous  character  identification.  The 
input  controller  must  assign  a sequence  number  to  each  character  and  direct 
the  loading  of  this  character  into  the  memory  of  a "ready"  character 
processor. 

The  output  controller  must  receive  the  identified  characters  and 
output  them  in  proper  sequence.  Unidentified  characters  are  directed  to 
the  error  correction  routines. 

Although  the  ultimate  design  of  the  proposed  hardware  will  include 
sufficient  capacity  to  process  each  character  in  the  fonts  of  interest, 
Phases  1 and  2 provide  for  the  building  of  a limited  subsystem,  sufficient 
to  determine  hardware  feasibility. 

By  using  only  two  to  four  character  processors  in  the  limited  subsystem 
the  processing  speed  will  be  limited,  if  all  masks  are  used.  However,  the 
ability  of  the  system  to  use  different  sets  of  masks  as  its  alphabets 
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enables  ,i  wi  le  range  uf  system  tests  to  be  complc-ted.  The  essential 
cons ivletvir ions  in  the  hardware  desigi.  will  be  answered  with  this  amount  of 
cir-cuitry . 

The  proposed  hardware  can  be  built  v;ith  attention  given  to  appropriatf- 
ai-ai;.t.-ters  of  minimum  noise  interference  and  cost  with  maximum  chip  density, 
T ransm.ission  rates,  and  processing  speeds. 

A cross  assembler  would  be  purchased  for  the  purpose  of  translating 
the  assembler  language  program.s  for  the  micro  processors  into  their  machine 
language. 

It  is  felt  that  the  experience  gained  on  the  design  development,  test, 
and  evaluation  effort  re;.jrt€d  herein,  corribined  with  general  expertise 
obtait.ed  by  direct  rf  spoiisib i lity  for  several  OCR  hardware  projects,  in  all 
;hases  thereof,  is  necessary  for  efficient  execution  of  the  suggested 
companion  classifier  hard .v are  effort,  described  below. 

6.3,  TECHNICAL  AIPPOACh 

The  hardware-seftw  .re  configuration  will  automatically  classify  (when 
possible)  a subset  of  all  characters  on  digitized  images  of  typeset  Latin 
(English)  or  Cyrillic  (Russian)  text,  previously  scanned  from  technical 
journals  and  other  sources.  It  will  also  provide  a facility  for  different 
subsets  of  those  characters  though  its  initially  limited  scope  renders 
simultaneous  capability  for  all  characters  impossible. 

The  software  system  on  the  Phase  1 host  com.puter  will  assume  that  the 
automatic  detection,  location,  and  isolation  of  individual  characters  wi’] 
h.avo  been  perfcrtied.  Th.ese  isolated  charact.rr^'  .-Jill  be  presented  to  tr.r 
. lassif iudt ioi  logic  to  be  either  recognized  o;  rejected. 

l..e  character  proce'-sor  tardware  wil^  c.rsi.st  of  sufficient  !-;ardware 
to  realize  a character  r -ognition  capability.  In  addition  to  such  misce-- 
lar.ecus  equipments  as  ; : !=upplies  and  IMA  hardware  for  com.m.unication 

with  the  host  computer  ( a -.r-ll/us),  the  hardware  produced  will  include 
the  character  procetsoj  l -.trc  Her  and  the  character  recognition  logic  :or 
two  to  four  character  pr  . • , ''s.  The  chora-ter  processor  subsystem  will 

be  constructed  with  'lir.'.  .•  .ircuits  and/cr  sockets  to  permit  wir  o-vjranped 
connection  of  the  ir.‘eg;5  .''  Lrcuits.  Eackf  lane  wiring  will  be  •rovidec.. 
Intel  3000  series  mi  or  ’ocessors  will  probably  be  used  for  the  charactei 
piccessor  and  choractci-  • ocessor  controller;  however-,  final  selection  of 
om.ponents  v;ill  be  based  on  the  cost,  capability,  and  availability  of 
i<  vices  current  at  the  tim.e  the  work  is  undertaken. 
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Programs  to  implement  the  required  functions  on  the  microcomputer  will 
be  written  in  the  appropriate  assembler  language.  To  translate  the  assembler 
language  to  the  necessary  micro  machine  language,  a cross  assembler  which 
runs  on  the  11/45  will  be  purchased  and  modified  as  necessary. 

The  software  system  will  provide  a convenient  means  to  store  the 
classification  results  and  the  images  of  characters  rejected  and  misclassified . 

6.4.  PROCESS  TIME  ESTI.MATE 

An  estimate  has  been  made  of  the  processing  time  required  for  character 
recognition  in  the  character  processors.  The  assumed  processor  is  a bi- 
polar micro-processor  with  a total  system  cycle  time  of  150ns  utilizing  16 
bit  words.  The  recognition  process  consists  of  three  decision  levels  where 
correlations  between  the  unknown  data  character  and  stored  character  masks 
are  evaluated.  As  characters  are  identified  they  exit  from  the  recognition 
process  and  therefore  all  characters  do  not  reach  Phase  2 or  Phase  3 of  the 
process.  The  estimated  processing  time  is  the  average  time  per  character 
taking  into  account  the  statistical  frequency  of  occurrence  for  the  characters 
as  well  as  the  number  of  correlations  and  process  phases  required  to  identify 
each  character. 

The  timing  estimates  were  made  using  a sample  character  set  consisting 
of  32  lower  case  Cyrillic  (Russian)  alphabetic  characters.  The  total 
average  processing  time  per  character  is  144  milliseconds.  This  represents 
a straight  line  approach  without  the  application  of  timesaving  techniques. 

To  meet  the  specified  processing  rate  of  no  more  than  10  seconds  per 
page  (2000  characters  per  page  avg.),  some  time  reduction  techniques  will 
be  employed.  Among  the  more  significant  techniques  to  be  evaluated  is  the 
averaging  of  adjacent  horizontal  and  vertical  pixels  for  a possible  data 
reduction  factor  of  4,  and  the  use  of  parallel  character  processors  with  a 
time  reduction  factor  proportional  io  the  number  of  processors  used. 

Assuming  full  utilization  of  the  two  techniques  mentioned  above,  it 
would  require  eight  character  processors  to  meet  the  requirement  of  10 
seconds  per  page  or  5 ms  per  character  for  the  32  character  test  font.  li 
the  number  of  different  characters  processed  at  one  time  is  increased,  the 
average  time  per  character  will  increase.  The  exact  amount  of  increase 
relative  to  a particular  expansion  of  the  character  set  has  not  been 
determined  at  this  time. 
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SECTION  7 


SUMMARY  AND  FUTURE  WORK 


The  project  effort  resulted  in  an  evaluation  of  a correlation  technique 
tor  implementation  in  a hardware  character  recognition  processor.  The 
source  of  the  five  Cyrillic  and  the  five  Latin  fonts  processed  experimentally 
using  design  simulation  software  was  several  Russian  technical  journals.  A 
total  of  17,611  characters  (7356  Latin  and  10255  Cyrillic)  were  processed. 

The  correlations  were  performed  on  the  RADC  HIS  6180  computer  facility 
under  the  MULTICS  time-sharing  environment. 

It  was  shown  that  presorting  the  samples  by  size  will  eliminate  the 
inclusion  problem;  that  is,  some  characters  are  included  in  others  as  parts 
and  obtain  high  correlation  with  those  parts.  Presorting  also  reduced 
overall  computation  time.  Registration  of  sample  and  mask  by  center  of 
mass  resulted  in  an  overall  recognition  rate  of  97.91%.  When  the  method  of 
shifting  masks  +1  pixel  in  each  direction  was  added,  this  rate  rose  to 
99.53%.  The  weighted  mask  technique  increased  this  rate  to  99.56%. 

A time-analysis  study  showed  the  1st  and  2nd  stages  of  the  classifier 
to  be  cost-effective.  The  3rd  stage  of  weighted  difference  masks  takes  a 
large  proportion  of  computer  time  to  resolve  a small  number  of  errors.  In 
view  of  this,  a new  algorithm  was  devised  for  implementation  in  the  most 
error-prone  confusion  group  of  Cyrillic  characters,  F,  H and 
U . This  procedure  was  tested  in  the  C8  font  and  increased  the  recognition 
rate  to  100%,  using  1506  characters. 

An  error  vs.  rejection  trade-off  analysis  showed  that  to  reject  any 
proportion  of  erroneously  classified  characters,  a greater  proportion  of 
correctly  classified  characters  must  be  rejected. 

A preliminary  study  of  entropy,  average  intensity,  and  average  length 
of  bxack/white  segments  revealed  these  statistics  to  be  reasonably  consistent 
ever  any  region  of  either  text  or  non-text.  Whether  these  statistics  are 
individually  sufficient  to  distinguish  various  text  from  non-text  requires 
further  investigation. 

To  increase  the  effectiveness  of  the  character  isolation  function, 
interaction  between  the  extraction  and  classification  procedures  is  needed. 
That  is,  the  classifier  should  relay  its  results  to  the  extraction  system 
for  use  in  re-isolation  of  rejected  characters. 

A hardware  throughput  study  determined  that  design  limitations  prohibit 
the  use  of  an  average  of  2000  pixels  per  character.  That  is,  the  use  of  a 
ufl  micron  "spot"  results  in  too  much  image  data  to  economically  process  200 
characters  per  second.  A simulation  of  lovjer  resolution  by  reducing  patterns 
5:;%  linearly  on  a 1200  character  subset  increased  the  error  rate  for  the 


sample  from  .6%  to  approximately  1%.  It  will  be  necessary  to  study  specific 
application  requirements  to  determine  if  the  increase  in  error  rate  is 
tolerable.  If  smaller  error  rates  are  required,  either  more  processor 
units  and/or  some  reduction  in  speed  will  be  needed. 

Presorting  the  samples  by  size  has  shown  to  lead  to  a reduction  in 
computation.  Another  area  of  investigation  which  should  be  examined  is  the 
use  of  conditional  probabilities  to  bias  the  classification  based  on  the 

identity  of  the  previous  sample.  A method  for  Identifying  specific  fonts  ■ 

needs  to  be  determined;  the  alternative  is  a possible  six-fold  increase  in 
the  number  of  parallel  processors.  It  should  be  determined  whether  an 
operator  should  enter  a font  type  or  just  certain  characteristics  of  each 
batch  run,  e.g.,  journal  name,  which  lend  themselves  to  automatic  identifi- 
cation. An  improved  method  for  creating  masks  or  training  the  machine  when 
new  fonts  are  encountered  needs  to  be  devised.  Certain  character  confusion 
groups  appear  common  to  several  fonts.  More  experimentation  is  needed  to 
substantiate  this  possibility.  The  alternative  is  that  many  fonts  will 
require  some  special  logic,  possibly  designed  manually,  to  resolve  the 
problem  cases.  Completely  automatic  design  schemes  remain  a concept,  not 
yet  proven  in  practice,  when  correlation  logic  is  used. 

This  study  was  conducted  in  the  context  of  an  eventual  hardware  imple- 
mentation. The  idea  of  parallel  microprocessors  to  compute  the  necessary 

correlation  is  essential.  Specialized  hardware  is  needed  to  improve  the  I 

throughput.  The  average  forced  recognition  error  rate  of  0.5%  is  encourag- 
ing considering  the  fact  that  the  source  material  was  not  meant  to  be  read 
by  OCR  equipment.  To  verify  an  error  rate  of,  say  0.5%,  to  a confidence 
level  of  95%,  requires  the  processing  of  about  the  number  of  characters 
used  on  this  project.  However,  if  better  recognition  is  achieved  it  will 
be  necessary  to  process  many  more  characters  to  be  confident  that  the 
measured  error  rate  will  hold  in  practice.  Tg  use  cgmputer  simulation  to 
process  such  a large  amount  of  data,  e.g.,  10  to  10  characters,  would  be 
uneconomical.  It  is  recommended  therefore  that  the  next  step  be  to  imple- 
ment a complete  hardware  prototype,  if  it  is  necessary  to  verify  error 
rates  of  0.1%  or  less.  Before  going  to  the  expense  of  building  a complete 
prototype  that  will  scan  the  printed  page,  isolate  and  classify  characters, 
it  should  be  shown  that  it  is  probable  that  0.1%  error  rate  can  he  achieved. 

This  effort  suggests  that  the  unaided  correlation  scheme  will  not  achieve 

error  rates  less  than  0.5%  even  at  high  resolution  (40  micron  spacing  i 

between  samples).  It  also  showed  that  special  logic  that  was  specifically 

designed  and  added  to  the  basic  scheme  to  resolve  special  sets  of  confusion 

characters,  was  quite  effective  in  lowering  the  error  rate  on  the  data  to 

which  it  was  applied.  It  is  recommended  that  this  approach  be  continued 

and  tested  on  more  fonts.  J 


( 
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APPENDIX  A 

SOFTWARE  DOCUMENTATION 


(Program  descriptions  and  file  structures  dev^■loped  for  this  project  are 
included  with  accompanying  flowcharts). 


I'rogratr.  Name;  char_j  rocess 
General  Program  Description; 


II. 


This  subi'outine  computes  the  corTelation  tor  the  tii'st  pass 
of  the  f f I ! ■ ; I ■ ' la  ■ 1 1 i I I . I h'  ■ I ii( . It  chai  u ' d I m<-n  i on'  a i 

multiplied  by  1.14  and  .86  to  determine  the  range  of  dimensions 
of  possible  masks  to  be  used  in  the  correlation.  V<hen  the  list  of 
masks  has  been  generated,  the  correlation  is  then  computed  according 
to 


The  normalized  variable  z has  been  computed  ny  "tr-insformer" 
prior  to  execution  of  char_process . This  process  is  repeated  until 
all  m.ask  candidates  have  beeii  cor'rela'^ed . The  progiam  exits  by 
returning  control  to  ''cornel  ma'n". 


Program  Name : 


collect  pattpins 


General  Program  Dercription: 

The  function  of  "colleci^_natterns"  is  to  gather  various  transformed 
patterns  for  use  in  the  creation  of  masks.  The  mask  patterns  ai’e  created 
by  executing  "transformer"  on  the  character  files  after  they  are  isr lated 
by  "doer".  Each  character  has  a three-word  header  which  contains  trie 
identification  number  of  the  character.  In  the  Latin  fonts,  the  identifica- 
tion numbers  are  assigned  sequentially  with  "a"  having  number  1 and  "z" 
having  number  26.  The  list  of  characters  to  be  used  by  "collect_nat terns" 
is  generated  by  "masi-_-:-Piect" . This  list  contains  a data  file  name  and 
index  number  of  each  pattern  to  be  used  by  "collect_natterns" . The  characters 
are  collected  and  placed  in  an  output  file  labelled  ■mask_set"  for  use  by 
"mask_generator" . 


^ollect_pat  terns 


1 


Program  Name:  correl  main 

Heneral  Program  Description: 

This  routine  coordinates  the  correlation  of  unknown  character  samples 
to  known  masks.  Initially,  the  dimensions  of  the  sample  are  compared  to 
the  masks  to  determine  which  masks  will  be  used  in  the  correlation. 

The  subroutine  "correlate"  computes  the  correlations  among  the  sample 
and  the  masks.  The  ratios  of  the  correlations/lowest  correlation  are  then 
calculated  to  determine  if  a second  stage  of  correlation  is  necessary.  If 
the  second  stage  is  necessary,  "more_processing"  is  called  to  compute  the 
correlations  among  the  sample  and  shifted  masks.  The  ratios  of  the  cor- 
relations/lowest correlation  are  checked  to  determine  if  the  third  stage  is 
necessary.  If  the  third  stage  is  necessary,  the  routine  "weight_gen"  creates 
the  weighted  difference  masks  and  "even_more_ processing"  calculates  the 
correlations.  The  sample  is  then  classified  or  rejected  based  on  the  reject 
ratio.  The  results  of  the  processed  data  sets  are  placed  in  output  files 
for  use  by  "summarize." 
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correljnain 


Compare  Sample 
and  Mask  Dimen- 
sions to  Determine 
which  Mask  will  b' 


ompute  Ratios  of 
orrelations/ 
owest  Correlatio: 


Return 


Call  "more, 
processing 


Compute  Ratio 
of  Correlatioi 
Lowest  Corre- 
lation 


yilumben^^ 

'of  Ratios^ — 
<1.4  Gre^i^r 
~N»hany^? 


Return 


Call"weight  _ 
gen"to  Create; 
Weighted  Dif- 


Compute  Ratios 
of  Correlation: 
Lowest 

Correlations 


Return 


Program  Name : dcrim 


{ 

I 


General  Program  Description: 

This  routine  is  an  alternative  method  of  logic  to  be  used  to  classify 
the  most  error-prone  Cyrillic  characters,  H,  11  and  n . 

Initially,  the  pattern  is  subjected  to  a vertical  histogram  to  isolate 
the  area  between  the  two  strong  vertical  segments  contained  in  each  of  the 
three  characters.  A horizontal  histogram  of  the  pixels  located  between  the 
two  large  vertical  segments  then  provides  insight  into  the  identity  of  the 
sample.  If  the  number  of  non-zero  elements  is  zero  then  the  character  is 
n . If  the  number  is  greater  than  zero  but  less  than  seven,  the  identity 
is  H,  else  the  character  isH  . 


i 


I 

j 
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dcrim 


Obtain  Vertical 
Histogram  of 
Sample 


Obtain  Horizontal 
Histogram  of  i 
Area  between  j 
Vertical  Segment^ 


Sample  is 


Return 


Program  Name : 


doer 


General  Program  Description: 

This  is  the  character  segmentation  program.  It  extracts  characters  from 
the  LIPS  image  and  places  them  into  output  files  for  use  in  mask  generation 
and  character  classification.  Prior  to  executing  doer,  the  program  ocr  will 
have  been  run  to  store  the  LIPS  images  as  disk  files. 

Initially,  doer  creates  a grey  level  histogram  from  the  input  file  for 
use  in  the  determination  of  the  mean  background  value,  u and  the  mean 
character  value,  u . The  data  is  then  thresholded  at  a value  estimated 
to  be  at  the  stop  or  the  background  distribution.  The  thresholding  is 
essential  for  use  in  the  line  find  algorithm. 

A histogram  in  the  horizontal  oi  X direction  is  calculated  and  examined 
for  presence  of  text  lines.  If  the  width  of  any  line  is  greater  than  130 
pixels,  it  is  assumed  to  be  two  or  more  lines.  This  "line"  is  then  segmented 
into  two  lines  and  the  width  of  the  resulting  lines  are  examined.  This 
process  is  continued  until  no  lines  are  greater  than  130  pixels  in  width. 

Each  line  is  then  subjected  to  a word  and  character  extraction  procedure. 

A vertical  or  Y histogram  is  performed  and  the  column  summations  are  stored 
in  an  array.  The  start  of  a word  is  defined  as  that  point  where  12/15 
array  elements  are  greater  than  0.  The  stop  of  a word  is  defined  as  the  point 
where  12/15  array  elements  are  equal  to  0.  Each  word  is  examined  for  "char- 
acters" where  the  character  set  consists  of  printable  characters;  such  as 
the  alphanumerics  and  punctuation,  and  non-printable  characters  such  as  ink 
smears  and  pencil  smudges. 

Experience  has  shown  that  no  standard  characters  in  the  Cyrillic  or 
Latin  alphabets  exceed  90  pixels  in  width.  Initially,  the  array  representing 
the  word  is  examined,  from  the  start,  for  a minimum  over  90  pixels.  The 
distance  from  the  starting  location  of  the  word  to  the  first  minimum  is  the 
width  of  the  first  character.  The  90  pixels  after  the  first  minimum  are 
examined  for  a minimum.  This  distance  is  the  width  of  the  second  character. 

This  process  continues  until  each  word  in  a line  is  examined  and  the  coordinates 
for  the  width  of  the  characters  are  determined.  Given  the  vertical  coordinates 
of  a line  and  the  horizontal  coordinate  of  a character,  the  height  of  each 
character  can  be  determined.  When  the  x and  y dimensions  of  each  character 
are  known,  the  character  is  extracted  from  the  input  file  and  placed  into  the 
output  file.  Along  with  each  character,  the  font,  and  the  date  are  extracted 
and  the  dimensions  are  stored  for  use  by  the  classification  programs. 


Obtain  Grey  Level] 
Histogram  and 
Determine  Average] 
Background  y 


5— w 


B’ 


J 

•Threshold  the 
■Input  File 


s 

1 

Create  Horizontal 
Histogram  and  Do 

the  Line 

Find  ! 

v: 


Lheck  Dimensions 
of  Each  Line  and 
Solve  Line  Over- 
lap if  Necessary 
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id 

Value, 
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Perform  Verti 
cal  Histogram 
for  Line 


Find  Start  and 
Stop  of  the  Word 
in  this  Line  and 
Set  Minimum  Loca 


find  Start  of  [ 
Next  Character  asj 
First  Non-Zero  • 
Pixel  After  the  . 


Examine  the  Next 
90  Array  Element 
for  Minimum  and 
Stop  of  Characte: 


Program  Name:  even_more_processing 

General  Program  Description; 

This  routine  calculates  the  weighted  correlation  between  the  input 

character  X and  the  mask  M.  The  Fisher  pairwise  logic  is  used  to  select  the 

mask  M as  the  id,  of  the  input  X in  this  routine.  For  K given  masks, 

M,  there  are  '((K-D/2  comparisons:  M vs  M vs  M 

IK  i z 1 K 

M vs  M . The  weights  w.  are  calculated  based  on  the  difference  between 
K—  IK  1 

the  two  involving  masks  M,  M' . For  instance,  = |m^  - m^|  . Then  the 

2 2 

two  weighted  correlations  c(X,M)  = IW.(x.-m.)  and  c(X,M')  = Zw.(x.-m.) 

111  111 

N,  N 

are  compared.  The  mask  with  the  smaller  correlation  value  will  get  an 
additional  "vote".  After  all  the  K(K-l)/2  comparisons  are  made,  the  mask 
M with  the  most  votes  is  assigned  as  the  id. of  the  input  character  X. 
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Program  Name:  line_stats 

General  Program  Description: 

The  purpose  of  line_stats  is  to  calculate  the  entry  E,  the  average 
grey  level  f,  and  the  average  segment  length!  in  an  image  line. 

After  entering  the  N points  f^,...,fjj  in  a scan-line,  the  b point-counts 

a 2 of  the  256  grey  levels  are  calculated.  The  entropy 

I 256 

E = -1  Z g.  log  g.  and  the  average  grey  level 
— i=l  " 

_ 256 

f = 1 Z i - g-  at'e  calculated. 
i=l 

The  input  line  is  rescanned  to  find  the  number  of  segments  C as  follows: 
if  f i ^ ^ f . 1 ^ ^i  1 ^i  ^ ^ ^ segment 

is  found.  The  average  segment  length  is  L = N/C. 


A-16 


Program  Name:  mask  generator 

General  Program  Description: 


This  program  will  average  a specified  number  of  patterns  to  form  a 
mask.  The  input  data  are  the  extracted  character  files  and  the  output 
data  are  masks  stored  in  "mask  directory".  The  pattern  centroids  are  used 
for  registering  the  patterns  before  averaging.  There  are  four  options 
available:  1)  isolated  background  spots  may  be  deleted  before  averaging, 

2)  two-dimensional  distortion  of  each  mask,  3)  pixel  values  can  be  adjusted 
thus  blacking  out  part  of  the  mask,  or  4)  selecting  a threshold  value  for 
the  mask. 

The  specified  number  of  patterns  per  mask  are  then  extracted  from  the 
input  file  "mask_set"  and  averaged  to  form  the  mask.  This  mask  is  then 
subjected  to  the  options  listed  above  and  is  placed  in  the  output  file 
"mask  directory". 


Obtain  Appropriate 
Number  of  Pattemjs 
per  Mask  and  | 


yShoulcK 
'Backgrouil 
sSpots  be^ 
^lepdd: 
No 


Call 

noise_plippe 


^.^ould  Y 

^^  Mask  be  ~ ! 

\Thresholdfed? 


Obtain  Thresho|ld 
■ and  Set  Pixels; 
Threshold  ! 
to  0—  - 


Output  Mask 
to  File 
"mask  directo 


Return 
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Program  Name: 


mask__select 


General  Program  Description: 

This  routine  generates  the  list  of  characters  to  be  used  in  creating 
masks  by  "collect  patterns"  and  "mask^generator" . The  identification  numbers 
of  characters  is  assessed  to  randomly  select,  when  possible,  4 samples  of 
character.  The  file  name  and  number  of  sample  to  be  used  is  then  output 
to  a list  for  use  by  "collect_patterns" . 


jk 

Generate  Two 
Random  Numbers , 
One  Corresponding 
to  an  Input  File, 
the  other  to  a 
Character  Within 
that  File 


Place  File 
Name  and  Char- 
acter Number 


Program  Name:  more ^processing 

General  Program  Description: 


This  routine  correlates  the  input  sample  against  shifted  masks  in  the 
second  stage  of  correlation.  The  correlation 

C(X,M)  = I (X^  - 

N 

is  computed  using  the  8 positions  about  the  mask  center-of-mass  as  center 
points.  The  program  exits  by  returning  to  "correl jjiain" . 


I 


I 


I 


Program  Name:  noise_clipper 

General  Program  Description: 

This  subroutine  removes  "noise"  points  from  a character  pattern. 
A noise  point  is  defined  as  any  pixel  whose  grey-level  intensity 
is  above  the  character-final  threshold  but  which  does  not  have  at 
least  two  neighboring  pixels  which  are  above  that  threshold. 

All  noise  points  are  replaced  with  the  background  value  of  the 
character  pattern. 
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Examine 
Pixels  and 
Neighbors 


V 

Is  \ 

this  No 

\ 

Pixel  a i 

■Noise  Point? 

) 

; Yes 


Replace  Pixel 
with  Back-  | 
I ground  Value  ■ 


Pro-  ^ 

^cessed  No 
all  Pixels'?' 


\ 


Return 
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Program  Name : ocr 

General  Program  Description: 

This  routine  converts  the  LIPS  image  files  from  9-track  magnetic  tape 
into  MULTICS  disk  files  for  use  by  the  character  segmentation  program,  doer 

As  each  record  of  an  image  file  is  read  into  MULTICS,  the  last  2 bits 
of  each  8-bit  byte  are  dropped  for  purposes  of  minimizing  storage  require- 
ments. Each  image  is  stored  under  a user-supplied  unique  5-character  file 
name. 
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Program  Name : spot  remover 

General  Program  Description; 

This  subroutine  deletes  small  isolated  "spots"  from  a character 
pattern.  A spot  is  defined  as  a pattern  whose  total  mass  is  less 
than  5%  of  the  character's  mass  and  is  totally  disjoint  from  the 
pattern. 

A count  is  performed  of  the  number  of  non-zero  pixels  in  each 
row  and  column.  If  there  are  any  isolated  rows  and  columns,  the  total 
number  of  pixels  contained  in  the  area  are  summed.  If  this  figure  is 
less  than  5%  of  the  total  number  of  non-zero  pixels,  this  area  is 
replaced  with  background  values. 
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Spot  remover 


r 


Perform  Row  and 
Column  Histograms 


i 


Calculate  Mass 
of  Isolated 
Spots 

Is 

Mass  ' No 

■ Less  than  

5%  of  Total  Mass  ? 


Yes 

I 

I Replace  Spot 
[ with  Average 
Background  Value 


Return 
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Program  Name:  summarize 

General  Program  Description: 

This  program  summarizes  the  classification  results  that  are  stored  in 
the  output  files  created  by  "correljnain" . Each  output  file  is  examined  and 
statistics  regarding  number  of  each  sample  at  each  stage  of  correlation, 
number  of  errors  at  each  stage,  percentage  of  errors,  average  number  of  correla- 
tions per  sample  at  each  stage,  average  correlation  value,  total  number  of 
samples,  total  number  of  errors,  and  the  overall  adjusted  error  rate  are 
presented . 
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Program  Name:  summarize_ trade  off 


i General  Program  Description: 

i 

i This  routine  is  similiar  to  "summarize"  in  function  but  also  presents 

) information  regarding  error  vs.  rejection  trade-off  rates.  The  output 

files  created  by  "correl_main"  are  examined  and  statistics  describing 
i number  of  each  sample  at  each  stage  of  correlation,  number  of  errors  at  each 

stage,  percentage  of  errors,  average  correlation  value,  total  number  of 
I samples,  total  number  of  errors  and  adjusted  error  rate  are  presented, 

i]  In  addition,  the  data  files  are  analyzed  to  observe  the  effect  of  reject 

ratios.  These  ratios  are  employed  such  that  if  the  ratio  of  second  lowest 
! correlation/lowest  correlation  does  not  exceed  the  reject  ratio,  the  sample 

L.  is  rejected  rather  than  classified.  This  information  relating  the  number  of 

V samples,  number  of  errors,  number  of  rejects,  and  number  of  correctly  clas- 

I sified  samples  is  presented  for  reject  ratio  rates  varying  from  1.00  to 


1.30. 
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Program  Name:  transformer 

General  Program  Description: 

This  routine  converts  an  input  pattern  array  of  6 bits  to  a grey-level 
normalized  pattern  of  4 bits.  The  normalization  formula  is 

A'(i,j)  = A(i, j ) - A 

a 

A 

where  A(i,j)  is  an  element  of  the  input  pattern,  A is  mean  grey-- level  of 
the  input  pattern  and  a ^ is  standard  deviation  of  the  input  pattern. 

The  output  pattern  is  also  shifted  so  that  the  centroid  of  the  pattern 
is  located  at  the  center. 
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Program  Name:  weight _gen 

General  Program  Description: 


Calculate  the  weights  based  on  the  difference  between  two  masks  M,  M' 
bvW  =1m,  -Mil  ,i=l,...,N  where  N is  the  dimension  of  the  larger  of 

^ i i 1 

M and  M ' . 
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I Determine  the  Dimension 
I N of  the  Virtual  Array 
I which  is  the  Larger  of 
the  Two  Masks  M,  M’ 


FILE  STRUCTURES 


An 

project 


important  part  of  the  description  of  the  software  developed  for  this 
is  the  description  of  the  file  structures  used  by  the  programs. 


File  Type: 

Naming 
Convention : 

Example: 

Purpose : 


Format : 


ID 

id.  <,  transformed  file  name^  or  id.  ( DOCR  file  name^ 
id.c6t.c6001  or  id.c6.c6001 

The  ID  file  is  a list  of  the  identities  of  the  characters 
of  a file.  The  identities  contained  in  the  file  correspond 
1-to-l  with  the  similarly  named  transformed  file  of 
isolated  characters.  The  ID  file  is  used  by  MASK  SELECT 
and  INSERT_IDS. 
line  1 list 

lines  2... a list  of  the  ids  of  the  characters  in  the 
corresponding  transformed  file;  a zero  ("0")  is  used 
wherever  the  corresponding  character  is  not  to  be  con- 
sidered for  classification  (since  such  classification 
would  be  meaningless ) . 


File  Type: 

Naming 

Convention; 

Example : 


DOCR 

■ font  name/  . / name  of  original  image  file^ 

c6 .c6001 
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Purpose : 


Format : 


The  DOCK  file  (usually  output  by  DOCK)  contains 
several  (1  or  more)  individual,  isolated  character 
images.  DOCR  files  are  also  used  for  masks  (output 
by  MASK_GENERATOR)  or  collections  of  selected 
patterns  (output  by  COLLECT_PATTERNS ) . 

1 or  more  pattern  structures 

Pattern  "] 

I Structure 

1 

i 

I 

I 

i ■ . .. 

Pattern 

Structure 

where  a pattern  structure  consists  of  a three-word 
header  (36  bit  words)  followed  by  the  pattern  data 
array . 


word  1 

|^font_id  j char_id 
35  18  17 


word  2 

I Month  i day 
0 35  24  23  12  11 


word  3 

year  I no_rows 
0 35 


' no  cols 
18  17 


The  header  words  are : 


font_id:  Identity  of  the  original  document  from  which 

the  data  was  taken. 


char_id:  Identity  of  the  character  pattern,  if  known, 

month, day, year:  The  date  of  isolation  of  the  sample. 
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no  rows:  The  number  of  rows  of  data  in  the  pattern. 


File  Type: 

Naming 
Convention : 

Example : 

Purpose : 


Format : 


no  cols:  The  number  of  columns  of  data  in  the  pattern. 

The  pattern  data  array  is  laid  out  as  no_rows  lines  of 
data.  Each  line  of  data  is  a whole  number  of  MULTICS 
words  (36  bits)  packed  with  no_cols  6-bit  intensity  values 
and  padded  (on  the  right)  with  zeroes. 


TRANSFORMER 

< font  name^'t.  v.name  of  original  image  file"^ 
c6t . c6001 

The  TPAiNSFORMER  file  (output  by  OUTPUT_MULTICS , a subroutine 
of  TRANSFORMER)  contains  a collection  of  characters  to  be 
input  by  the  correlation  program  CORREL_MAIN. 

1 or  more  pattern  structures 


Pattern 
1 Structure 


Pattern 

Structure 


where  a pattern  structure  consists  of  a 3 word  header,  a 
fourth  word  which  provides  the  fill  word,  and  a pattern 


; 

I 

( 

I 


i 

( 


I 

I 
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array.  The  three-word  header  contains  12  seven-bit 
fields  which  represent; 

font_id:  The  identity  of  the  document  from  which  the 

pattern  was  extracted. 

char_id:  The  identity  of  the  character  contained  in  the 

pattern  array. 

month , day , year : The  date  of  pattern  extraction. 

no_rows,  no_cols : The  virtual  size  of  the  pattern  array, 

size:  1 

hi_row:  The  highest  row  coordinate  of  actual  pattern 

data  within  the  virtual  array. 

lo_row:  The  lowest  row  coordinate  of  actual  pattern  data 

within  the  virtual  array. 

hi  col,  lo  col:  The  analogs  of  hi  row  and  lo  row  with 

respect  to  the  column  coordinates. 

The  fill  word  is  that  particular  normalized  intensity  which 
is  to  be  assigned  to  each  value  in  the  virtual  array 
which  is  not  in  the  actual  array. 
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array.  The  number  of  bits  of  normalized  intensity  is 


specified  by  the  variable  "precision"  in  the  file 


MASK  DIRECTORY." 


Pictorially  the  arrangement  may  be  viewed  as: 


2 
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SINGLE  CHARACTER  PROCESSOR  TIMING 


APPENDIX  B 


SINGLE  CHARACTER  PROCESSOR  TIMING 


To  better  understand  the  processing  time  requirements  for  the  character 
processors,  reference  should  be  made  to  Figure  6-1  for  an  overview  of  the 
hardware  configuration,  and  Figure  3-9  in  Section  3,  a chart  of  the  system 
data  flow. 

The  microprocessor  used  for  these  timings  has  a cycle  time  of  150ns. 
Addition  and  subtraction  of  sixteen  bit  words  requires  three  cycles  with 
multiplication  completed  in  two  cycles  using  a table  look-up  technique. 

The  character  processor  (CP)  system  timing  is  based  on  a 32-character 
font  with  the  characteristics  outlined  in  Figure  B-1. 

The  timing  is  primarily  concerned  with  the  arithmetic  function  of  the 
correlations  between  the  incoming  data  and  the  stored  character  masks,  with 
a 50%  factor  added  for  data  handling  overhead. 

To  facilitate  the  timing  calculations,  the  average  size  mask,  expressed 
in  pixels,  and  the  average  number  of  correlations  performed  per  character 
for  each  of  the  three  correlation  phases  were  computed.  When  calculating 
the  above  averages , the  frequency  of  occurrence  for  each  character  was  used 
as  a weighting  factor. 

The  algorithm  for  determining  the  average  mask  size  for  each  of  the 
three  phases  then  became: 

N 

AMS  = f(n)  . M(n)  . R(n)  . C(n) 

n=l 


Frequency  of  occurrence  of  the  n^^  character  (expressed 
as  a fraction) 

Number  of  masks  used  (masks  within  +15%  of  n^'^  char,  size) 

Number  of  pixel  rows 
Number  of  pixel  columns 
Number  of  charactei s in  font 

The  average  number  of  correlations  per  data  character: 


where 

f (n ) = 

M(n)  = 

R(n)  = 
C(n)  = 
N 

ANC  = 
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Frequency  of  No.  of  Masks  No.  of  Masks  No.  of  Masks 

Occurrence  in  Size  in  Pixels  used  for  used  for  used  for 

Character  Russian  Text  No.  Rows  No.  Columns  Correlation  I Correlation  II  Correlation  III 


r 


CO  CM  CM  (O 


lO  CO  CO 


CN 


CM 


^ JtCOCMJj-  CMJ'CNlDOOCD.d’tOiO  COCM 


CN  CM  CO  U-)  ir>  j* 


f--Lnr^cocDcotoOr-Mir>cococT)CMOOrHtococDioCNcouncocNr^rHLnoooJ’CN 


r-i^.:tc3^*Hcocor-ir>r^^r^OincMir).H»HiHOcococo.HcTJr-t^tnmfHr^(0 

j-.:j-.:^coir)a'C-^coir*Lnir>jfcOLr).:^ic>m.^.^tncDmi/>ir)c^r-tr>coct.3‘COLn 


aicOfHcor^inr-fHiHC^a‘C^r^r^r^»-irHc><r)rHOCT>r'COir>coiocorMco»-4t^ 

mcoir)a^cDu^u^u^ir)CDLnmmin.::tior*«-intor^aij-^£)mmr^inir)iOLnir)Lr) 


or^*H<ocn<^f-(oococo:tLnLnr-‘r^or^c^::tcoHcOnHcor-»rHrHuocT>r'CT>r^ 
COLntDCOr^li:)OOcDCO<J>U^knCN<X>:3’  .:^CNCOcDlC)COOcOCOCO.:^O.d■*HCOJ•  o 


^-*HJ^•r-^CNCT>OrHOOO(0.d•COU50CMlJnlO(X)CNOr^O^^OOOCN»HOOr^ 


cdocq(L,cta)ScuS^3awt^sxocci,oEH  •■'»*&  xJ=^D‘3atQa>«ro2« 


r 


Fip,ure  B-1  Characteristics  of  Typical  Font 


N 

ANC  = f(n)  . M(n)  Phase  1 

n=l 

N 

ANC  = f(n)  . M(n)  Phase  2 

n=l 

NOTE : The  multiplier  of  4 represents  four  correlations  made  on  each 

character  with  the  centroid  point  shifted  one  pixel  distance  from 
center  in  4 positions  90°  apart,  performed  on  the  average  in 
Phase  2. 

The  operations  required  to  complete  a character  correlation  consist  of  a 
subtraction,  an  addition,  and  a multiplication  for  each  pixel  and  oi.e 
division  per  character.  These  operations  are  derived  from  the  correlation 
algorithm  (See  Section  3), 

The  time  to  complete  the  correlations  in  each  phase  is  expressed  in  the 
following  algorithm. 

(AMS  • T -ANC*  1.5)  + 750  = T^ 

P C 


where 

AMS 

T 

P 

ANC 

1.5 

750 

TC 


Average  mask  size 

Calculation  time  for  one  pixel 

Average  number  of  correlations  per  character 

50%  data  handling  overhead 

Division  time  in  nanoseconds 

Correlation  time  per  character  in  nanoseconds 


Phase  1:  2840  x 1200  x 12.4  x 1.5  + 750  = 63,389,550ns  = 63.4MS 


Phase  2:  2875  x 1200  x 7.9  x 1.5  + 750  = 40,883,250ns  = 40.9MS 

Phase  3:  2097  x 1200  x 0.4  x 1.5  + 750  = 2,086,590ns  = 2. IMS 

Total  correlation  time  per  character  = 106. 4MS 

Further  investigation  will  be  necessary  to  determine  the  exact  heuristic 
procedures  that  will  provide  the  optimum  results  in  Phase  3 of  the  character 
processing.  A pixel  weighting  scheme,  as  well  as  empirical  measurements 
for  individual  characters,  is  to  be  evaluated.  Final  selection  of  the 
Phase  3 procedure  could  result  in  a variance  of  as  much  as  minus  50%  to 
plus  100%  of  the  Phase  3 timing. 


APPENDIX  C 
DATA  SAMPLES 


The  Cyrillic  journals  from  which  the  data  for  the  project  was  extracted  are 
listed  below.  In  the  instances  where  the  identifying  font  number  is  the 
same  for  both  languages,  such  as  LI  and  Cl,  the  Latin  is  the  translation  of 
the  Cyrillic  journal  and  is  located  in  the  same  journal. 

Samples  from  each  of  the  fonts  are  located  below  and  contain  both  the 
original  chips  as  digitized  by  the  LIPS  system  and  MULTICS  representation 
of  the  character. 


Font : 

Cl, 

LI 

M3  TOJiorvr®  " 

C2, 

L2 

■■  iiyri'Taji 

C4, 

L4 

" ..“tvo  Han  MVTKo  o'-vojior  m?n  vjiern'njioTVv 

11 

MMMyHObMOnOr  I'M" 

L5 

C6 

" a.BTOua' 

riria  M TejiaidexaHiTKa  " 

CO 

o 

L8 

n 

TeoDCT  I'yei^HOH 

C-1 


r 


irrent  (1  • A>  was  paswd  I»n  tiH^aiUi  (*l  a bi 

microel(*(  trodf  msertt'd  i!i  a horizotital  tf!l  of  Ifu*  lurllt*  retina 
e retina  was  accompanied  by  an  imreasv  m resistanu*  of  the  ini< 
M)  This  ifUfcase  rrsulted  m a cbaruje  o!  the  celt  Ivimlb  resp« 
that  the  changes  in  resistance  are  h 'alized  not  in  the  cell  mer 
:ell  near  the  rnicn>electrode  tip  The  effect  described  can  be 
current  IhrouKh  the  second  barrel  of  a double-barreled  microeh 
had  to  be  several  times  stronijer.  than  m passing;  through  th 
n the  membrane  potential  of  a liorizontal  cell  was  shifted  f 
potential  (by  means  of  a constant  current  passed  IhrouKh  the  i 
the  effect  of  the  microelectrode  resistance  increast*  during 
simultaneously  with  hyperpoiarization  light  response  itself.  Or 
■xlrinsic  hyperpoiarization  of  the  cell  membrane  was  acrojnpai 
1 depolarization  — b\  a decrease  in  the  microelectrode  resistance 
he  effect  found  can  be  explained  by  the  "dosing  up*'  of  the  rm 
intracellular  structure,  the  resistance  of  which  is  a function  of  t 


yicyciaseari(jproduces 


Font  LI  Sample  Data 


■istics  of  evoked  oetivity  to  li(;lit  were  studied  in  the  vi- 
rehral  licniispheres  in  iwent\  three  liealthy  .Mihjeets 
reactivity  asymmetry  were  found.  The  first  is  predomin 
at  hinocular  stimulation  and  is  linked  witli  a gre; 
illations  of  P4  and  of  sensory  alpha-a.lerdiseharge  of 
in  the  suhdominant  hemisphere.  The  second  t\pe  of  as 
■I  cases  of  monocular  stimulation.  It  consists  in  a bilat 
oscillations  (sometimes  the  P2-02  comple.v  is  , 
niulation  (jf  the  right  e\e.  This  effect  is  related  to  the  i 
thetical  reinforcing  retino-|)res(riate  pathway  which  a| 
he  structures  of  the  temporal  lobe  the  rigid  (suhdi 
e as  its  major  link. 


eaiedintheunitatio 


Font  L2  Sample  Data 
C-2 


L. 


Activation  of  allergic  focal  reaction  under  the  effect  of  spi 
pigs  with  dysentery  keratoconjunctivitis  was  accompanied  by  ■ 
dular  and  motor  apparatuses  of  the  intestine,  and  disturbance  of 
Conjunctival  application  of  a complete  Sonnei  antigen  and  Troitst 
se  of  alkaline  phosphatase  secretion  (1.34 — 1. 39- fold),  and  of  ente 
Patients  with  acute  and  chronic  enterocolitis,  acute  anr 
marked  functional  disturbances  of  the  glandular  apparatus  and 
vitv  of  the  intestine  reflecting  the  pathogenetic  regularities  in 
fection  of  the  intestinal  tissue  of  specific  and  nonspecific  or 
Indices  of  enteropeptidase  and  alkaline  phosphatase  acti 
could  serve  as  reliable  diagnostic  criteria  of  functional  disturb 
and  chronic  inflammatory  diseases  of  the  intestine  of  infectioi 
and.  along  with  results  of  other  methods  of  investig-stion,  aide* 
sentery  from  carrier  state. 


xicsubph^gocytosisandp 


Font  L4  Sample  Data 


A study  has  bi.en  made  of  the  structure  of  the  blends 
amide  and  polyethylene  with  polyamide  AK  60/40  form 
of  unslabilized  and  commercial  polymers.  Depending  c 
a Ii'cl:,  It  is  possible  to  oblaiii  solid  dispersions  polymer 
ture  or  disperse  systems  of  interpenetrating  macroneti 
Small  amounts  of  surfactants  favor  the  formation  of  fil 
formation  accompanied  by  «neckmg»  breaks  up  into  separa' 


ynthesisofcaicjumionk 


Font  L5  Sample  Data 
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In'>t.it>ili)y  of  ini  cl<><  troll  It'Min  in  ;i  low  < oio  eiitratii 
oxpc-mui'fitiilly.  A '•(Mviai  fi-alnn-  of  llio  rxpcnnifn 
iM-iini  111  ’ (in),  an  i'xIm'Iiu'  romiilion  foi 

lion  ot  |.i.('-ma  iii'lalolilto  of  tltr  diift  lypc.  The  «anou 
menon,  \shicli  aiipcar**  on  ih  vrlopint'nt  of  instaliility,  is 
(•xjMTiiiiciital  iv*«nll’<  \sith  Ific  tlieory  shows  that,  under 
vostiyiiU  il,  ex.ipc  of  ioji''  from  llic  Ia*ani  is  due  to  their 
field  hy  axial-non^ymmclric  i-Ict  tron-ion  oseillalions;  th 
ronse<'uli\c  kinetic  and  Itydrodynaniic  build  up  of  drift 


Font  L8  Sample  Data 


iTHoro  Hepsa,  no  BiuHMOMy,  Her  o6mnx  CHHan- 
rHwecacafl  aenpeccHH  OTcyrcTByeT.  rpa(J>HKH,  no- 
cpejHHX  jaHiibix,  no.iyMeHHbix  b sthx  onuTax, 

owHbix  orpaweiiUbix  b.thbhhh  co  CTopoHbi  buc- 
HcpBHoft  CHCTeMbi  iipH  pasApaweHHH  ;iopcajib- 
flHHHHH,  KOTOpbie  MOryT  BOaHHKHyib  BCJieACTBHC 
aiomero  Toxa  Ha  npoTHBonojio>KHyK)  ciopony  co 
oro  Aopca;ibHoro  xaHaTHKa,  psA  onuTOB  npoBe- 
X JKHBOTHbIX  C HpilMeHCHHeM  FAyCOKOfl  AByCTO- 
►Hbix  xanaTHKOB  HH>Ke  H Bbiuie  Mecrra  pasApa- 
MCHeHHH  niin,  BbiaBaHHhix  Hpo6HbiMH  pasApa- 
aAHCb  OT  Tex,  KOTOpbie  6bIAH  oSHapyHteHbl  a pH 
aopcaAbHoro  ctoa63. 

■nbITOB  HCC.ieA'>BaAOCb  BAHHHHe  npCABapHTeAbHO- 
aro  KanaTHKa  na  BeAHHHHy  aHTHApoMHWx  otbc- 
1H6M  TepMHHaAefl,  bxoahiuhx  b cerMCHT  L?  nep- 
:TpHpyeMbix  b a({)4)epeHTHOM  nepBe  hah  (})HAa- 

r 1 r\i 


TeHCHBHOCTHXnpHBO, 


Font  Cl  Sample  Data 
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\it/h.iy  i.iiMoiiiJMii  ii[)ouf(.ca.vii. 

oit;ii(HH.x  K.  AOy.ia.ue  [I,  2]  iioKaaaiio,  mto  CKpuTU 
I (.laTfin lujf  o'laiH)  Moiyr  iipoHn.'iHTb  AOMniianTnue  c 
iKj  K HMOHi.  HO jiiiiKanjinuMy  uoaGy/K/ieiiHio.  Cjicaob. 
uj,  ycii.iiinaHcii  aa  cict  tipHxo.iHiuero  BoaCy/KAeiiHH,  c 
Ha  xapaKTC'p  iipoTeKamiH  noHeacM'iecKofi  peaKU.nn.  Bot 
11.,  cHnta.iH3HpyKMUHH  of)opomne.nL.Hyio  peaKUHK),  np 
aBepmeiiHH  imineBoro  pei}j.ieKca,  BuiauBaeT  oaHoapeMe 
iHo  T3K  >Kc  HHiueBOH  ciiriia.i,  HcribiTaniu.iH  iioc.ie  o6op( 
ca,  conpoBO/K.iaercH  SiiiiapiibiM  poc[);ieKCo.M. 

. >iTo  Iioc.ie  .leiicTUHH  cn;ibHoro  pa,aApa>KHTe.iH,  xaKO 
HH  TDK,  OCo6ciUIO  .UIITC.'IDIU  COXpailHCTCH  C.ie.l  B036\ 

o xapaKiepa  (3].  IIoaxo.My  B036y>K.'ieHHe,  BosiiMKaioui 
menoro  airiia.ia,  mo/Kct  iio.aHocibio  oiB.ieKaTbCH  b ct 
ibiio  npH.Meiieiiiioro  o6opoHHTe.ibHoro  pe^i.icKca.  Bnei 
noAaBjieHHCM  h.ih,  Bepnee,  iieiipoHB.ieHneM  nimieBOH  j 
D Mepe  noBTopeiiHH  iiiimeBoro  ciiriia.ia  c iioAKpeiiAeHHe 
iBbiuieniiH  B036y,iH.M0CTH  B cipyKTypax  iiiiiueBoro  pi 

a 'ifip  1 T 'loniiijp  npaKiiiui  a •jutavi  nnonr-voTHT  M OTHHihf' 


Font  C2  Sample  Data 


sci.l\vjlo  UI.MeUaeTCH  H3  lU-H  ACHb  HOC. 

CTaij)H.10KOKKOBUM  aHaTOKCHHOM  H npOXOflHT  K 30— 
6.1H3HTe.1bHO  B 3TH  CpOKH  (21— 25-H  aCHb),  T.  e.  B (J>a3 
Te.lCH  oOuiefl  peaKTHBHOCTH,  MbI  H oOc.ieAOBaAH  Haul 
TaKHM  06pa30M,  MCHbUiaH  npOAyKUHH  aHTHCT. 
B OTB6T  Ha  BBCAeHHe  noBbiuieHHbix  fl03  aHTHrena  (cracj 
Ha)  HBHjiacb  c.ieacTBHeM  ypHeTCHHa  o6meft  peaKTHE 
TBepwflaeT  raKJKe  An4x|)epeHUHaAbHbiii  aHa^Hs  peay^i 
paSHOM  COCTOHHHH  o6mefi  peaKTHBHOCTH,  CBHflereJIbC' 
nOpUHOHa.lbHoll  SaBHCHMOCTH  CneUH(j)HHeCKOH  peaKTI 
CKoft.  y OoBbHbix  c BucoKHM  ypoBH6M  oOuieft  pea 
THTp  CTa(})HJ10K0KK0B0r0  aHTHTOKCHHa  paBHHACH  5,2 
ypoBHe  o6meft  peaKTHBHOCTH  oh  cocTaB.THA  ],9±0,2 
Hbie  H03BOJIHIOT  cflejiaTb  cjieayiomHfl  BbiBoa;  h6m  hh) 
THBHOCTH,  TCM  MCHbUlC  Bbipa>KeHa  CHOCOOHOCTb  O 

cneuH(})HHecKHe  aHTHTe^a  npn  HMMyHH3auHH. 

Hya<HO  OTMCTHTb,  HTo  no.TyueHHoe  b peayAbTaTe  i 
ypOBHH  aHTHCTai})H.TOKOKKOBbIX  aHTHTCA  COXpaHHAOC 


1 i-t 


f ft 


Font  C4  Samnle  Data 
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0 li  i.  tJtJ  1 tlU  1 ^ J UilU  L-  1 C-Jiy  4dUiiliiA 

HTo  II  iii.iy  ipriuii'iiiocrii  iipouecroB  x(l)  ii  y{.t)  acl 
♦ Hpi'Mi'Uiibi.vi  xapiiKTcpiK'TiiK  coBiiajaK'T  mex 
Te.ihtio,  II  ;i.iH  Bfox  iia'ia.'iLiiux  yc.ioBiiii  c BcpoHTB 
po.ib  ciivaciii*  imi'pauiimiiiux  ycitaiiTi'aeii  bo  bPomh 
anai  H Hopi-a  i;aa;ai-ii‘  'ill  Mini.,  iipii'ie.M  nnfiipepbiBU 
iicniMaaai'i.  ycTaiioiiKoii  cooTBeTCTByKimiix  uawaabu, 
Tope  5.  IIo  iioayMciiiibiM  peaaiiaauiiHM  ouan  Bumic 

CpeaHOKIiaapaTHMIlble  OTKaOIIOHlIH  II  UOpMHpOBaHIIIJt 
mill.  Ho  Bi'i’X  cay'iaiix  oTiiociiTeabiiue  oiiihCkh  iio  cj 
Mil  aiiaMi'iiiiHMii  Ilf  iipfBocxoaiiaii  57o,  ‘ito  cootbci 
BbiMinai' iiiiii  iia  .\HM  NlH-7. 

Ilpoiifpua  niiioTfabi  o paBiioMOpiiOM  pacnpeaea 
xil)  iipoiioaiiaacb  HO  KpiiTcpiiH)  lliipcona  y/  npn  aot 
0,95,  a iipoBt’pKa  *cay>iaiiiiocTii»  ciiruaaa  y(l)  npoi- 
Toaa  copiiii. 

KpUMf  TOIO,  MyabTHlUHliaTHHHblH  HTepaUMOHHl 
iK'iii.i  ri*iha.icH  iia  iina.ioroHoit  ‘lacTu  rii6pi*,'iHOM  hi 
HK',  llH*  It  UficTiixyTc  \ itpaii.KMuiH.  (’xesia 

Ht'paTiip,  co;u*p>K;i.ia  2 rj.i(ji;a  (xpaiit*Hii>i 


pKoroo5teKTanocpe; 
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OBiitee  BpeMH  HMeetCH  Soabmoe  KoaHiecTBO  pa6oT  (ci 
HayqeHBK)  Hffepuoro  (HMP)  h aaeKTpouHoro  (3I1P)  k 
B H cKopocTH  coBB-pemeToqHOB  peaaKcai^BH  (CPP5K)  b : 
THHecKOM  onacaRHH  3THX  3KcnepiiMeHTOB,  Kaa  npaBHao,  p 
iMtiKy  oTAeabBux  MoaeKya  ii  iiorth  ne  yqHTUBaioT  koji 
HHii  B HCBAKocTH.  HanpHMep,  Bcerna  cnaTaeTca  I’],  q: 
Kymeii  ikhakocth  tot  we,  bto  ii  b noKOHiBeBCfl.  ITpH  bb 
iKjia.ta  BHyTpHMOfleKyBBpBbix  noneH  b CPPJK  pacck 

alUaTf.TbByiO  ;|H(|l(])y3BI0  TOJIbKO  lIBABBHAyaBbHHX  MOJleB 
lecTHO,  HTO  B >kb;ikoctb  upn  BaaiiHHH  cabbtobux  uanpBiK 
e opBeBTaniiB  nec^epHqecKHx  uo-TOKyn  CTaHosHTCH  Henac 
Abt  k aByayHenpejicMBeiiHK)  CBeTa  b whrkoctii,  xeKymei 
locTeH  (a^({)eKT  MaKCBe.i.ia  [*]). 

□OKaaaHo,  hto  3to  opHeBTauHonHoe  BsanMOAcftcTBRe  a 
Kyji  CO  cBBuroBUMB  HanpHwenHHMH,  a TaKwe  cnaa-Bpai 
CTBHP  npiiBoaHT  K H3MeHeniiio  CBPKTpa  fl\fP  (Hiirjia  B qa 
lIBBii)  B TCKymefl  WHaKOCTH  no  CpaBHeBHHI  C nOKOBBieHC 
(XU  ifou  uMuiirjxoHuu  rKonox'.Til  CPPIH.  VnacTCB  CBBx 


no^THHey^HTHBa 
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MISSION 

of 

Rome  Air  Development  Center 


RADC  plans  and  conducts  research,  exploratory  and  advanced 
development  programs  In  command,  control,  and  conmuni cations 
(C^)  activities , and  in  the  areas  of  information  sciences 
and  intelligence.  The  principal  technical  mission  areas 
are  communications,  electromagnetic  guidance  and  control, 
surveillance  of  ground  and  aerospace  objects,  intelligence 
data  collection  emd  heuidling,  information  system  technology, 
ionospheric  propagation,  solid  state  sciences,  microwave 
physics  and  electronic  reliability , maintainahility  and 
compatibility . 
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